WO2024073275A1 - Interface de restitution pour données audio dans des systèmes de réalité étendue - Google Patents

Interface de restitution pour données audio dans des systèmes de réalité étendue Download PDF

Info

Publication number
WO2024073275A1
WO2024073275A1 PCT/US2023/074583 US2023074583W WO2024073275A1 WO 2024073275 A1 WO2024073275 A1 WO 2024073275A1 US 2023074583 W US2023074583 W US 2023074583W WO 2024073275 A1 WO2024073275 A1 WO 2024073275A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
visual
scene
audio element
descriptive information
Prior art date
Application number
PCT/US2023/074583
Other languages
English (en)
Inventor
Imed Bouazizi
Thomas Stockhammer
Isaac Garcia Munoz
Nikolai Konrad Leung
Andre Schevciw
Graham Bradley Davis
Original Assignee
Qualcomm Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US18/467,869 external-priority patent/US20240114312A1/en
Application filed by Qualcomm Incorporated filed Critical Qualcomm Incorporated
Publication of WO2024073275A1 publication Critical patent/WO2024073275A1/fr

Links

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/60Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/50Controlling the output signals based on the game progress
    • A63F13/54Controlling the output signals based on the game progress involving acoustic signals, e.g. for simulating revolutions per minute [RPM] dependent engine sounds in a driving game or reverberation against a virtual wall
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/80Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game specially adapted for executing a specific type of game
    • A63F2300/8082Virtual reality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field

Definitions

  • This disclosure relates to processing of audio data.
  • Computer-mediated reality systems are being developed to allow computing devices to augment or add to, remove or subtract from, or generally modify existing reality experienced by a user.
  • Computer-mediated reality systems (which may also be referred to as “extended reality systems,” or “XR systems”) may include, as examples, virtual reality (VR) systems, augmented reality (AR) systems, and mixed reality (MR) systems.
  • VR virtual reality
  • AR augmented reality
  • MR mixed reality
  • the perceived success of computer-mediated reality systems are generally related to the ability of such computer-mediated reality systems to provide a realistically immersive experience in terms of both the visual and audio experience where the visual and audio experience align in ways expected by the user.
  • the human visual system is more sensitive than the human auditory systems (e.g., in terms of perceived localization of various objects within the scene), ensuring an adequate auditory experience is an increasingly import factor in ensuring a realistically immersive experience, particularly as the visual experience improves to permit better localization of visual objects that enable the user to better identify sources of audio content.
  • This disclosure generally relates to techniques for providing a separate audio interface that facilitates rendering at the audio playback system.
  • the techniques may enable an audio playback system to synchronize playback of audio elements to playback of visual elements.
  • the audio playback system may include an interface (such as an application programming interface - API) that an audio system may expose in order to facilitate interactions with a scene manager that manages playback of one or more visual elements that support an extended reality (XR) scene.
  • XR extended reality
  • the audio elements may not be captured at the same time as the visual elements or may be added later (e.g., during a XR mediated conference, such as an XR videoconference).
  • the audio playback system may invoke the scene manager to match one or more visual elements to one or more audio elements (e.g., by comparing a name or other unique identifier - UID - associated with each of the one or more visual elements and one or more audio elements).
  • the scene manager may modify audio metadata defining a pose (which may refer to a position and/or orientation) of the one or more audio elements to more closely correspond to the matching one or more visual elements.
  • the scene manager may then output the modified audio metadata to an audio unit of the audio playback system, which may render the audio elements to one or more speaker feeds.
  • the audio playback system may then output the one or more speaker feeds to one or more speakers (which may also be referred to as loudspeakers, headphone speakers, or more generally as transducers).
  • the techniques may improve operation of the audio playback system as the audio playback system may more accurately reproduce a soundfield (based on the one or more speaker feeds) to potentially improve the immersive experience of XR systems. That is, rather than render the audio elements based on low resolution audio metadata that may not match the corresponding visual element, the audio playback system may modify the audio metadata to more closely match the corresponding visual element, thereby increasing the immersion of the XR experience through higher resolution audio metadata. As such, various aspects of the techniques described in this disclosure may improve the audio playback system itself.
  • the techniques are directed to a device configured to process an audio bitstream, the device comprising: a memory configured to store a visual bitstream representative of at least one visual element in an extended reality scene and the audio bitstream representative of at least one audio element in the extended reality scene; and processing circuitry coupled of the memory and configured to: map, based on visual metadata associated with the at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element; modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain modified audio metadata; render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • the techniques are directed to a method of processing at least one audio element, the method comprising: mapping, based on visual metadata associated with at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element; modifying, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain modified audio metadata; rendering, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and outputting the one or more speaker feeds.
  • the techniques are directed to a non-transitory computer- readable storage medium having instructions stored thereon that, when executed, cause one or more processors to: map, based on visual metadata associated with the at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element; modify, based on the mapped of the at least one visual element to the at least one audio element, the audio metadata to obtain modified audio metadata; render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • the techniques are directed to a device configured to process an audio bitstream, the device comprising: a memory configured to store a visual bitstream representative of at least one visual element in an extended reality scene and the audio bitstream representative of at least one audio element in the extended reality scene; and processing circuitry coupled of the memory and configured to execute a scene manager and an audio unit, wherein the scene manager is configured to: map, based on visual metadata associated with the at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element; modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain modified audio metadata; and register, with the audio unit, a callback by which the audio unit is configured to request the modified audio metadata prior to rendering the at least one audio element, and wherein the audio unit is configured to: render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • the techniques are directed to a method of processing at least one audio element, the method comprising: mapping, by a scene manager executed by processing circuitry and based on visual metadata associated with at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element; modifying, by the scene manager and based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain modified audio metadata; rendering, by an audio unit executed by the processing circuitry and based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and outputting, by the audio unit, the one or more speaker feeds.
  • the techniques are directed to a non-transitory computer- readable storage medium having instructions stored thereon that, when executed, cause one or more processors to: execute a scene manager configured to: map, based on visual metadata associated with the at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element; and modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain modified audio metadata; and execute an audio unit configured to: render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • the techniques are directed to a device configured to process an audio bitstream, the device comprising: a memory configured to store a visual bitstream representative of at least one visual element in an extended reality scene and the audio bitstream representative of at least one audio element in the extended reality scene; and processing circuitry coupled of the memory and configured to execute a scene manager, an audio processing unit, and an audio unit, wherein the scene manager is configured to: map, based on visual metadata associated with the at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element; modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain modified audio metadata; and configure the audio processing unit to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata, wherein the audio processing unit is configured to: replace, based on the configuration, the audio metadata in the audio bitstream with the modified audio metadata; and output the audio bitstream
  • the techniques are directed to a method of processing at least one audio element, the method comprising: mapping, by a scene manager executed by processing circuitry and based on visual metadata associated with at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element; modifying, by the scene manager and based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain modified audio metadata; configuring, by the scene manager, an audio processing unit to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata, replacing, by the audio processing unit and based on the configuration, the audio metadata in the audio bitstream with the modified audio metadata; and outputting, by the audio processing unit, the audio bitstream to an audio unit executed by the processing circuitry; rendering, by the audio unit and based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and outputting, by the audio unit, the one
  • the techniques are directed to a non-transitory computer- readable storage medium having instructions stored thereon that, when executed, cause one or more processors to: execute a scene manager configured to map, based on visual metadata associated with the at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element, and modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain modified audio metadata, and configure an audio processing unit to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata; execute the audio processing unit to replace, based on the configuration, the audio metadata in the audio bitstream with the modified audio metadata, and output the audio bitstream to the audio unit; and execute an audio unit configured to render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds, and output the one or more speaker feeds.
  • the techniques are directed to a device configured to process an audio bitstream, the device comprising: a memory configured to store a visual bitstream representative of at least one visual element in an extended reality scene and the audio bitstream representative of at least one audio element in the extended reality scene; and processing circuitry coupled of the memory and configured to execute a scene manager and an audio unit, wherein the scene manager is configured to: map, based on visual metadata associated with the at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element; construct, based on the mapping of the at least one visual element to the at least one audio element, a scene graph that includes a parent node representative of the at least one visual element, and a child node that depends from the parent node and that represents the at least one audio element; and modify, based on the scene graph, the audio metadata to obtain modified audio metadata, and wherein the audio unit is configured to: render, based on the modified audio metadata, the at least one audio element to one or more
  • the techniques are directed to a method of processing at least one audio element, the method comprising: mapping, by a scene manager executed by processing circuitry and based on visual metadata associated with at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element; constructing, by the scene manager and based on the mapping of the at least one visual element to the at least one audio element, a scene graph that includes a parent node representative of the at least one visual element, and a child node that depends from the parent node and that represents the at least one audio element; modifying, by the scene manager and based on the scene graph, the audio metadata to obtain modified audio metadata; rendering, by an audio unit executed by the processing circuitry and based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and outputting, by the audio unit, the one or more speaker feeds.
  • the techniques are directed to a non-transitory computer- readable storage medium having instructions stored thereon that, when executed, cause one or more processors to: execute a scene manager configured to: map, based on visual metadata associated with the at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element; construct, by the scene manager and based on the mapping of the at least one visual element to the at least one audio element, a scene graph that includes a parent node representative of the at least one visual element, and a child node that depends from the parent node and that represents the at least one audio element; and modify, by the scene manager and based on the scene graph, the audio metadata to obtain modified audio metadata; and execute an audio unit configured to: render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • a scene manager configured to: map, based on visual metadata associated with the at least one visual element and audio metadata associated with the at least one
  • the techniques are directed to a device configured to process a bitstream, the device comprising: a memory configured to store the bitstream representative of at least one audio element in the extended reality scene, and audio descriptive information associated with the at least one audio element; and processing circuitry coupled of the memory and configured to execute a scene manager and an audio unit, wherein the scene manager is configured to: construct, based on the at least one audio element, a scene graph that includes at least one node that represents the at least one audio element; and modify, based on the scene graph, the audio descriptive information to obtain modified audio descriptive information, and wherein the audio unit is configured to: render, based on the modified audio descriptive information, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • the scene manager is configured to: construct, based on the at least one audio element, a scene graph that includes at least one node that represents the at least one audio element; and modify, based on the scene graph, the audio descriptive information to obtain modified audio descriptive information
  • the audio unit is configured to:
  • the techniques are directed to a method comprising: obtaining a bitstream representative of at least one audio element in an extended reality scene, and audio descriptive information associated with the at least one audio element; and constructing, based on the at least one audio element, a scene graph that includes at least one node that represents the at least one audio element; and modifying, based on the scene graph, the audio descriptive information to obtain modified audio descriptive information, and rendering, based on the modified audio descriptive information, the at least one audio element to one or more speaker feeds; and outputting the one or more speaker feeds.
  • the techniques are directed to a non-transitory computer- readable medium having stored thereon instructions that, when executed, cause processing circuitry to: obtain a bitstream representative of at least one audio element in an extended reality scene, and audio descriptive information associated with the at least one audio element; and construct, based on the at least one audio element, a scene graph that includes at least one node that represents the at least one audio element; and modify, based on the scene graph, the audio descriptive information to obtain modified audio descriptive information, and render, based on the modified audio descriptive information, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • FIGS. 1A and IB are diagrams illustrating systems that may perform various aspects of the techniques described in this disclosure.
  • FIGS. 2-4 are block diagrams illustrating example architectures of the playback system shown in the example of FIGS. 1A and/or IB for performing various aspects of the audio rendering techniques described in this disclosure.
  • FIGS. 5 A and 5B are diagrams illustrating examples of XR devices.
  • FIG. 6 illustrates an example of a wireless communications system that supports audio streaming in accordance with aspects of the present disclosure.
  • FIGS. 7 is a block diagram illustrating example architectures of the playback system shown in the example of FIGS. 1A and/or IB for performing various aspects of the audio rendering techniques described in this disclosure.
  • FIG. 8 is a diagram illustrating an example of the audio playback system in performing a graph update in accordance with various aspects of the techniques described in this disclosure.
  • FIG. 9 is a diagram illustrating example listener space descriptor file (LSDF) alignment according to various aspects of the techniques described in this disclosure.
  • LSDF listener space descriptor file
  • FIG. 10 is a flowchart illustrating example operation of the content consumer device of FIG. 1 in performing various aspects of the techniques described in this disclosure.
  • FIG. 11 is a table illustrating example functions provided by an application programming interface exposed by the audio unit shown in FIG. 7 in accordance with various aspects of the techniques described in this disclosure.
  • Example formats include channel -based audio formats, object-based audio formats, and scene-based audio formats.
  • Channel -based audio formats refer to the 5.1 surround sound format, 7.1 surround sound formats, 22.2 surround sound formats, or any other channel-based format that localizes audio channels to particular locations around the listener in order to recreate a soundfield.
  • Object-based audio formats may refer to formats in which audio objects, often encoded using pulse-code modulation (PCM) and referred to as PCM audio objects, are specified in order to represent the soundfield.
  • Such audio objects may include metadata identifying a location of the audio object relative to a listener or other point of reference in the soundfield, such that the audio object may be rendered to one or more speaker channels for playback in an effort to recreate the soundfield.
  • PCM pulse-code modulation
  • the techniques described in this disclosure may apply to any of the foregoing formats, including scene-based audio formats, channel-based audio formats, object-based audio formats, or any combination thereof.
  • Scene-based audio formats may include a hierarchical set of elements that define the soundfield in three dimensions.
  • One example of a hierarchical set of elements is a set of spherical harmonic coefficients (SHC).
  • SHC spherical harmonic coefficients
  • the expression shows that the pressure p £ at any point ⁇ r r , 6 r , ⁇ p r ⁇ of the soundfield, at time /, can be represented uniquely by the SHC, .4TM(fc).
  • k c is the speed of sound (-343 m/s)
  • ⁇ r r , 6 r , ⁇ p r ⁇ is a point of reference (or observation point)
  • j n (-) is the spherical Bessel function of order n
  • YTM(0 r , ⁇ p r ⁇ ) are the spherical harmonic basis functions (which may also be referred to as a spherical basis function) of order n and suborder m.
  • the term in square brackets is a frequency-domain representation of the signal (i.e., S(m, r r , 0 r , ⁇ p r )) which can be approximated by various time-frequency transformations, such as the discrete Fourier transform (DFT), the discrete cosine transform (DCT), or a wavelet transform.
  • DFT discrete Fourier transform
  • DCT discrete cosine transform
  • Other examples of hierarchical sets include sets of wavelet transform coefficients and other sets of coefficients of multiresolution basis functions.
  • the SHC i4 ⁇ (/c) can either be physically acquired (e.g., recorded) by various microphone array configurations or, alternatively, they can be derived from channelbased or object-based descriptions of the soundfield.
  • the SHC (which also may be referred to as ambisonic coefficients) represent scene-based audio, where the SHC may be input to an audio encoder to obtain encoded SHC that may promote more efficient transmission or storage.
  • ambisonic coefficients For example, a fourth-order representation involving (1+4) 2 (25, and hence fourth order) coefficients may be used.
  • the SHC may be derived from a microphone recording using a microphone array.
  • Various examples of how SHC may be physically acquired from microphone arrays are described in Poletti, M., “Three-Dimensional Surround Sound Systems Based on Spherical Harmonics,” J. Audio Eng. Soc., Vol. 53, No. 11, 2005 November, pp. 1004-1025.
  • the coefficients A (k) for the soundfield corresponding to an individual audio object may be expressed as: where i is V— 1, k n ( ⁇ ) is the spherical Hankel function (of the second kind) of order n, and ⁇ r s , 6 S , (p s ⁇ is the location of the object. Knowing the object source energy $(&>) as a function of frequency (e.g., using time-frequency analysis techniques, such as performing a fast Fourier transform on the pulse code modulated - PCM - stream) may enable conversion of each PCM object and the corresponding location into the SHC ATM(k).
  • the A (k) coefficients for each object are additive.
  • a number of PCM objects can be represented by the A (k) coefficients (e.g., as a sum of the coefficient vectors for the individual objects).
  • the coefficients may contain information about the soundfield (the pressure as a function of 3D coordinates), and the above represents the transformation from individual objects to a representation of the overall soundfield, in the vicinity of the observation point ⁇ r r , 0 r , ⁇ p r ⁇ .
  • Computer-mediated reality systems (which may also be referred to as “extended reality systems,” or “XR systems”) are being developed to take advantage of many of the potential benefits provided by ambisonic coefficients.
  • ambisonic coefficients may represent a soundfield in three dimensions in a manner that potentially enables accurate three-dimensional (3D) localization of sound sources within the soundfield.
  • XR devices may render the ambisonic coefficients to speaker feeds that, when played via one or more speakers, accurately reproduce the soundfield.
  • ambisonic coefficients for XR may enable development of a number of use cases that rely on the more immersive soundfields provided by the ambisonic coefficients, particularly for computer gaming applications and live visual streaming applications.
  • the XR devices may prefer ambisonic coefficients over other representations that are more difficult to manipulate or involve complex rendering. More information regarding these use cases is provided below with respect to FIGS. 1A and IB.
  • the mobile device such as a so-called smartphone
  • the mobile device may present the displayed world via a screen, which may be mounted to the head of the user 102 or viewed as would be done when normally using the mobile device.
  • any information on the screen can be part of the mobile device.
  • the mobile device may be able to provide tracking information and thereby allow for both a VR experience (when head mounted) and a normal experience to view the displayed world, where the normal experience may still allow the user to view the displayed world proving a VR-lite-type experience (e.g., holding up the device and rotating or translating the device to view different portions of the displayed world).
  • a VR-lite-type experience e.g., holding up the device and rotating or translating the device to view different portions of the displayed world.
  • FIGS. 1A and IB are diagrams illustrating systems that may perform various aspects of the techniques described in this disclosure.
  • system 10 includes a source device 12 and a content consumer device 14. While described in the context of the source device 12 and the content consumer device 14, the techniques may be implemented in any context in which any hierarchical representation of a soundfield is encoded to form a bitstream representative of the audio data.
  • the source device 12 may represent any form of computing device capable of generating hierarchical representation of a soundfield, and is generally described herein in the context of being a VR content creator device.
  • the content consumer device 14 may represent any form of computing device capable of implementing the audio stream interpolation techniques described in this disclosure as well as audio playback, and is generally described herein in the context of being a VR client device.
  • the source device 12 may be operated by an entertainment company or other entity that may generate multi-channel audio content for consumption by operators of content consumer devices, such as the content consumer device 14. In many VR scenarios, the source device 12 generates audio content in conjunction with visual content.
  • the source device 12 includes a content capture device 300 and a content soundfield representation generator 302.
  • the content capture device 300 may be configured to interface or otherwise communicate with one or more microphones 5A-5N (“microphones 5”).
  • the microphones 5 may represent an Eigenmike® or other type of 3D audio microphone capable of capturing and representing the soundfield as corresponding scene-based audio data 11A-11N (which may also be referred to as ambisonic coefficients 11A-11N or “ambisonic coefficients 11”).
  • each of the microphones 5 may represent a cluster of microphones arranged within a single housing according to set geometries that facilitate generation of the ambisonic coefficients 11.
  • the term microphone may refer to a cluster of microphones (which are actually geometrically arranged transducers) or a single microphone (which may be referred to as a spot microphone or spot transducer).
  • the ambisonic coefficients 11 may represent one example of an audio stream. As such, the ambisonic coefficients 11 may also be referred to as audio streams 11. Although described primarily with respect to the ambisonic coefficients 11, the techniques may be performed with respect to other types of audio streams, including pulse code modulated (PCM) audio streams, channel -based audio streams, object-based audio streams, etc.
  • PCM pulse code modulated
  • the content capture device 300 may, in some examples, include an integrated microphone that is integrated into the housing of the content capture device 300.
  • the content capture device 300 may interface wirelessly or via a wired connection with the microphones 5.
  • the content capture device 300 may process the ambisonic coefficients 11 after the ambisonic coefficients 11 are input via some type of removable storage, wirelessly, and/or via wired input processes, or alternatively or in conjunction with the foregoing, generated or otherwise created (from stored sound samples, such as is common in gaming applications, etc.).
  • various combinations of the content capture device 300 and the microphones 5 are possible.
  • the content capture device 300 may also be configured to interface or otherwise communicate with the soundfield representation generator 302.
  • the soundfield representation generator 302 may include any type of hardware device capable of interfacing with the content capture device 300.
  • the soundfield representation generator 302 may use the ambisonic coefficients 11 provided by the content capture device 300 to generate various representations of the same soundfield represented by the ambisonic coefficients 11.
  • the soundfield representation generator 24 may use a coding scheme for ambisonic representations of a soundfield, referred to as Mixed Order Ambisonics (MO A) as discussed in more detail in U.S. Application Serial No. 15/672,058, entitled “MIXED-ORDER AMBISONICS (MOA) AUDIO DATA FO COMPUTER-MEDIATED REALITY SYSTEMS,” filed August 8, 2017, and published as U.S. patent publication no. 20190007781 on January 3, 2019.
  • MO A Mixed Order Ambisonics
  • the soundfield representation generator 24 may generate a partial subset of the full set of ambisonic coefficients (where the term “subset” is used not in the strict mathematical sense to include zero or more, if not all, of the full set, but instead may refer to one or more, but not all of the full set).
  • each MOA representation generated by the soundfield representation generator 24 may provide precision with respect to some areas of the soundfield, but less precision in other areas.
  • an MOA representation of the soundfield may include eight (8) uncompressed ambisonic coefficients, while the third order ambisonic representation of the same soundfield may include sixteen (16) uncompressed ambisonic coefficients.
  • each MOA representation of the soundfield that is generated as a partial subset of the ambisonic coefficients may be less storage-intensive and less bandwidth intensive (if and when transmitted as part of the bitstream 27 over the illustrated transmission channel) than the corresponding third order ambisonic representation of the same soundfield generated from the ambisonic coefficients.
  • the techniques of this disclosure may also be performed with respect to first-order ambisonic (FOA) representations in which all of the ambisonic coefficients associated with a first order spherical basis function and a zero order spherical basis function are used to represent the soundfield.
  • FOA first-order ambisonic
  • the soundfield representation generator 302 may represent the soundfield using all of the ambisonic coefficients for a given order N, resulting in a total of ambisonic coefficients equaling (N+l) 2 .
  • the ambisonic audio data may include ambisonic coefficients associated with spherical basis functions having an order of one or less (which may be referred to as “1 st order ambisonic audio data”), ambisonic coefficients associated with spherical basis functions having a mixed order and suborder (which may be referred to as the “MOA representation” discussed above), or ambisonic coefficients associated with spherical basis functions having an order greater than one (which is referred to above as the “full order representation”).
  • the content capture device 300 may, in some examples, be configured to wirelessly communicate with the soundfield representation generator 302. In some examples, the content capture device 300 may communicate, via one or both of a wireless connection or a wired connection, with the soundfield representation generator 302. Via the connection between the content capture device 300 and the soundfield representation generator 302, the content capture device 300 may provide content in various forms of content, which, for purposes of discussion, are described herein as being portions of the ambisonic coefficients 11.
  • the content capture device 300 may leverage various aspects of the soundfield representation generator 302 (in terms of hardware or software capabilities of the soundfield representation generator 302).
  • the soundfield representation generator 302 may include dedicated hardware configured to (or specialized software that when executed causes one or more processors to) perform psychoacoustic audio encoding (such as a unified speech and audio coder denoted as “US AC” set forth by the Moving Picture Experts Group (MPEG), the MPEG-H 3D audio coding standard, the MPEG-I Immersive Audio standard, or proprietary standards, such as AptXTM (including various versions of AptX such as enhanced AptX - E-AptX, AptX live, AptX stereo, and AptX high definition - AptX-HD), advanced audio coding (AAC), Audio Codec 3 (AC-3), Apple Lossless Audio Codec (ALAC), MPEG-4 Audio Lossless Streaming (ALS), enhanced AC-3, Free Lossless Audio Codec (FLAC), Monkey’s
  • MPEG Moving Picture Experts
  • the content capture device 300 may not include the psychoacoustic audio encoder dedicated hardware or specialized software and instead provide audio aspects of the content 301 in a non-psychoacoustic audio coded form.
  • the soundfield representation generator 302 may assist in the capture of content 301 by, at least in part, performing psychoacoustic audio encoding with respect to the audio aspects of the content 301.
  • the soundfield representation generator 302 may also assist in content capture and transmission by generating one or more bitstreams 21 based, at least in part, on the audio content (e.g., MOA representations, third order ambisonic representations, and/or first order ambisonic representations) generated from the ambisonic coefficients 11.
  • the bitstream 21 may represent a compressed version of the ambisonic coefficients 11 (and/or the partial subsets thereof used to form MOA representations of the soundfield) and any other different types of the content 301 (such as a compressed version of spherical visual data, image data, or text data).
  • the soundfield representation generator 302 may generate the bitstream 21 for transmission, as one example, across a transmission channel, which may be a wired or wireless channel, a data storage device, or the like.
  • the bitstream 21 may represent an encoded version of the ambisonic coefficients 11 (and/or the partial subsets thereof used to form MOA representations of the soundfield) and may include a primary bitstream and another side bitstream, which may be referred to as side channel information.
  • the bitstream 21 representing the compressed version of the ambisonic coefficients 11 may conform to bitstreams produced in accordance with the MPEG-H 3D audio coding standard and/or an MPEG-I standard for “Coded Representations of Immersive Media.”
  • the content consumer device 14 may be operated by an individual, and may represent a VR client device. Although described with respect to a VR client device, content consumer device 14 may represent other types of devices, such as an augmented reality (AR) client device, a mixed reality (MR) client device (or any other type of headmounted display device or extended reality - XR - device), a standard computer, a headset, headphones, or any other device capable of tracking head movements and/or general translational movements of the individual operating the client consumer device 14. As shown in the example of FIG.
  • AR augmented reality
  • MR mixed reality
  • the content consumer device 14 includes an audio playback system 16 A, which may refer to any form of audio playback system capable of rendering ambisonic coefficients (whether in form of first order, second order, and/or third order ambisonic representations and/or MOA representations) for playback as multi-channel audio content.
  • an audio playback system 16 A may refer to any form of audio playback system capable of rendering ambisonic coefficients (whether in form of first order, second order, and/or third order ambisonic representations and/or MOA representations) for playback as multi-channel audio content.
  • the content consumer device 14 may retrieve the bitstream 21 directly from the source device 12.
  • the content consumer device 12 may interface with a network, including a fifth generation (5G) cellular network, to retrieve the bitstream 21 or otherwise cause the source device 12 to transmit the bitstream 21 to the content consumer device 14.
  • 5G fifth generation
  • the source device 12 may output the bitstream 21 to an intermediate device positioned between the source device 12 and the content consumer device 14.
  • the intermediate device may store the bitstream 21 for later delivery to the content consumer device 14, which may request the bitstream.
  • the intermediate device may comprise a file server, a web server, a desktop computer, a laptop computer, a tablet computer, a mobile phone, a smart phone, or any other device capable of storing the bitstream 21 for later retrieval by an audio decoder.
  • the intermediate device may reside in a content delivery network capable of streaming the bitstream 21 (and possibly in conjunction with transmitting a corresponding visual data bitstream) to subscribers, such as the content consumer device 14, requesting the bitstream 21.
  • the source device 12 may store the bitstream 21 to a storage medium, such as a compact disc, a digital visual disc, a high definition visual disc or other storage media, most of which are capable of being read by a computer and therefore may be referred to as computer-readable storage media or non-transitory computer-readable storage media.
  • a storage medium such as a compact disc, a digital visual disc, a high definition visual disc or other storage media, most of which are capable of being read by a computer and therefore may be referred to as computer-readable storage media or non-transitory computer-readable storage media.
  • the transmission channel may refer to the channels by which content stored to the mediums are transmitted (and may include retail stores and other store-based delivery mechanism). In any event, the techniques of this disclosure should not therefore be limited in this respect to the example of FIG. 1 A.
  • the content consumer device 14 includes the audio playback system 16.
  • the audio playback system 16 may represent any system capable of playing back multi-channel audio data.
  • the audio playback system 16A may include a number of different audio Tenderers 22.
  • the Tenderers 22 may each provide for a different form of audio rendering, where the different forms of rendering may include one or more of the various ways of performing vector-base amplitude panning (VBAP), and/or one or more of the various ways of performing soundfield synthesis.
  • VBAP vector-base amplitude panning
  • a and/or B means “A or B”, or both “A and B”.
  • the audio playback system 16A may further include an audio decoding device 24.
  • the audio decoding device 24 may represent a device configured to decode bitstream 21 to output reconstructed ambisonic coefficients 11A’-11N’ (which may form the full first, second, and/or third order ambisonic representation or a subset thereof that forms an MOA representation of the same soundfield or decompositions thereof, such as the predominant audio signal, ambient ambisonic coefficients, and the vector based signal described in the MPEG-H 3D Audio Coding Standard and/or the MPEG-I Immersive Audio standard).
  • the ambisonic coefficients 11A’-11N’ (“ambisonic coefficients 11”’) may be similar to a full set or a partial subset of the ambisonic coefficients 11, but may differ due to lossy operations (e.g., quantization) and/or transmission via the transmission channel.
  • the audio playback system 16 may, after decoding the bitstream 21 to obtain the ambisonic coefficients 11’, obtain ambisonic audio data 15 from the different streams of ambisonic coefficients 11’, and render the ambisonic audio data 15 to output speaker feeds 25.
  • the speaker feeds 25 may drive one or more speakers (which are not shown in the example of FIG. lA for ease of illustration purposes).
  • Ambisonic representations of a soundfield may be normalized in a number of ways, including N3D, SN3D, FuMa, N2D, or SN2D.
  • the audio playback system 16A may obtain loudspeaker information 13 indicative of a number of loudspeakers and/or a spatial geometry of the loudspeakers.
  • the audio playback system 16A may obtain the loudspeaker information 13 using a reference microphone and outputting a signal to activate (or, in other words, drive) the loudspeakers in such a manner as to dynamically determine, via the reference microphone, the loudspeaker information 13.
  • the audio playback system 16A may prompt a user to interface with the audio playback system 16A and input the loudspeaker information 13.
  • the audio playback system 16A may select one of the audio Tenderers 22 based on the loudspeaker information 13. In some instances, the audio playback system 16A may, when none of the audio Tenderers 22 are within some threshold similarity measure (in terms of the loudspeaker geometry) to the loudspeaker geometry specified in the loudspeaker information 13, generate the one of audio Tenderers 22 based on the loudspeaker information 13. The audio playback system 16A may, in some instances, generate one of the audio Tenderers 22 based on the loudspeaker information 13 without first attempting to select an existing one of the audio Tenderers 22.
  • some threshold similarity measure in terms of the loudspeaker geometry
  • the audio playback system 16A may utilize one of the Tenderers 22 that provides for binaural rendering using head-related transfer functions (HRTF) or other functions capable of rendering to left and right speaker feeds 25 for headphone speaker playback.
  • HRTF head-related transfer functions
  • the terms “speakers” or “transducer” may generally refer to any speaker, including loudspeakers, headphone speakers, etc. One or more speakers may then playback the rendered speaker feeds 25.
  • rendering of the speaker feeds 25 may refer to other types of rendering, such as rendering incorporated directly into the decoding of the ambi sonic audio data 15 from the bitstream 21.
  • An example of the alternative rendering can be found in Annex G of the MPEG-H 3D audio coding standard, where rendering occurs during the predominant signal formulation and the background signal formation prior to composition of the soundfield.
  • reference to rendering of the ambisonic audio data 15 should be understood to refer to both rendering of the actual ambisonic audio data 15 or decompositions or representations thereof of the ambisonic audio data 15 (such as the above noted predominant audio signal, the ambient ambisonic coefficients, and/or the vector-based signal - which may also be referred to as a V-vector).
  • the content consumer device 14 may represent a VR device in which a human wearable display is mounted in front of the eyes of the user operating the VR device.
  • FIGS. 5A and 5B are diagrams illustrating examples of VR devices 400A and 400B.
  • the VR device 400A is coupled to, or otherwise includes, headphones 404, which may reproduce a soundfield represented by the ambisonic audio data 15 (which is another way to refer to ambisonic coefficients 15) through playback of the speaker feeds 25.
  • the speaker feeds 25 may represent an analog or digital signal capable of causing a membrane within the transducers of headphones 404 to vibrate at various frequencies. Such a process is commonly referred to as driving the headphones 404.
  • Visual, audio, and other sensory data may play important roles in the VR experience.
  • a user 402 may wear the VR device 400 A (which may also be referred to as a VR headset 400 A) or other wearable electronic device.
  • the VR client device (such as the VR headset 400A) may track head movement of the user 402, and adapt the visual data shown via the VR headset 400A to account for the head movements, providing an immersive experience in which the user 402 may experience a virtual world shown in the visual data in visual three dimensions.
  • VR and other forms of AR and/or MR, which may generally be referred to as a computer mediated reality device
  • the VR headset 400A may lack the capability to place the user in the virtual world audibly.
  • the VR system (which may include a computer responsible for rendering the visual data and audio data - that is not shown in the example of FIG. 5 A for ease of illustration purposes, and the VR headset 400 A) may be unable to support full three dimension immersion audibly.
  • FIG. 5B is a diagram illustrating an example of a wearable device 400B that may operate in accordance with various aspect of the techniques described in this disclosure.
  • the wearable device 400B may represent a VR headset (such as the VR headset 400A described above), an AR headset, an MR headset, or any other type of XR headset.
  • Augmented Reality “AR” may refer to computer rendered image or data that is overlaid over the real world where the user is actually located.
  • Mixed Reality “MR” may refer to computer rendered image or data that is world locked to a particular location in the real world, or may refer to a variant on VR in which part computer rendered 3D elements and part photographed real elements are combined into an immersive experience that simulates the user’s physical presence in the environment.
  • Extended Reality “XR” may represent a catchall term for VR, AR, and MR. More information regarding terminology for XR can be found in a document by Jason Peterson, entitled “Virtual Reality, Augmented Reality, and Mixed Reality Definitions,” and dated July 7, 2017.
  • the wearable device 400B may represent other types of devices, such as a watch (including so-called “smart watches”), glasses (including so-called “smart glasses”), headphones (including so-called “wireless headphones” and “smart headphones”), smart clothing, smart jewelry, and the like. Whether representative of a VR device, a watch, glasses, and/or headphones, the wearable device 400B may communicate with the computing device supporting the wearable device 400B via a wired connection or a wireless connection.
  • the computing device supporting the wearable device 400B may be integrated within the wearable device 400B and as such, the wearable device 400B may be considered as the same device as the computing device supporting the wearable device 400B. In other instances, the wearable device 400B may communicate with a separate computing device that may support the wearable device 400B. In this respect, the term “supporting” should not be understood to require a separate dedicated device but that one or more processors configured to perform various aspects of the techniques described in this disclosure may be integrated within the wearable device 400B or integrated within a computing device separate from the wearable device 400B.
  • a separate dedicated computing device such as a personal computer including the one or more processors
  • the wearable device 400B may determine the translational head movement upon which the dedicated computing device may render, based on the translational head movement, the audio content (as the speaker feeds) in accordance with various aspects of the techniques described in this disclosure.
  • the wearable device 400B may include the one or more processors that both determine the translational head movement (by interfacing within one or more sensors of the wearable device 400B) and render, based on the determined translational head movement, the speaker feeds.
  • the wearable device 400B includes one or more directional speakers, and one or more tracking and/or recording cameras.
  • the wearable device 400B includes one or more inertial, haptic, and/or health sensors, one or more eyetracking cameras, one or more high sensitivity audio microphones, and optics/proj ection hardware.
  • the optics/proj ection hardware of the wearable device 400B may include durable semi-transparent display technology and hardware.
  • the wearable device 400B also includes connectivity hardware, which may represent one or more network interfaces that support multimode connectivity, such as 4G communications, 5G communications, Bluetooth, etc.
  • the wearable device 400B also includes one or more ambient light sensors, and bone conduction transducers.
  • the wearable device 400B may also include one or more passive and/or active cameras with fisheye lenses and/or telephoto lenses.
  • the wearable device 400B also may include one or more light emitting diode (LED) lights.
  • the LED light(s) may be referred to as “ultra bright” LED light(s).
  • the wearable device 400B also may include one or more rear cameras in some implementations.
  • wearable device 400B may exhibit a variety of different form factors.
  • the tracking and recording cameras and other sensors may facilitate the determination of translational distance.
  • wearable device 400B may include other types of sensors for detecting translational distance.
  • wearable devices such as the VR device 400B discussed above with respect to the examples of FIG. 5B and other devices set forth in the examples of FIGS. 1 A and IB
  • a person of ordinary skill in the art would appreciate that descriptions related to FIGS. 1 A-5B may apply to other examples of wearable devices.
  • other wearable devices such as smart glasses
  • other wearable devices such as a smart watch
  • the techniques described in this disclosure should not be limited to a particular type of wearable device, but any wearable device may be configured to perform the techniques described in this disclosure.
  • the source device 12 further includes a camera 200.
  • the camera 200 may be configured to capture visual data, and provide the captured raw visual data to the content capture device 300.
  • the content capture device 300 may provide the visual data to another component of the source device 12, for further processing into viewport-divided portions.
  • the content consumer device 14 also includes the wearable device 800. It will be understood that, in various implementations, the wearable device 800 may be included in, or externally coupled to, the content consumer device 14. As discussed above with respect to FIGS. 5A and 5B, the wearable device 800 includes display hardware and speaker hardware for outputting visual data (e.g., as associated with various viewports) and for rendering audio data.
  • the wearable device 800 includes display hardware and speaker hardware for outputting visual data (e.g., as associated with various viewports) and for rendering audio data.
  • 3DOF refers to audio rendering that accounts for movement of the head in the three degrees of freedom (yaw, pitch, and roll), thereby allowing the user to freely look around in any direction.
  • 3DOF cannot account for translational head movements in which the head is not centered on the optical and acoustical center of the soundfield.
  • 3DOF plus provides for the three degrees of freedom (yaw, pitch, and roll) in addition to limited spatial translational movements due to the head movements away from the optical center and acoustical center within the soundfield.
  • 3DOF+ may provide support for perceptual effects such as motion parallax, which may strengthen the sense of immersion.
  • the third category referred to as six degrees of freedom (6DOF)
  • 6DOF renders audio data in a manner that accounts for the three degrees of freedom in term of head movements (yaw, pitch, and roll) but also accounts for translation of the user in space (x, y, and z translations).
  • the spatial translations may be induced by sensors tracking the location of the user in the physical world or by way of an input controller.
  • 3DOF rendering is the current state of the art for audio aspects of VR.
  • the audio aspects of VR are less immersive than the visual aspects, thereby potentially reducing the overall immersion experienced by the user, and introducing localization errors (e.g., such as when the auditory playback does not match or correlate exactly to the visual scene).
  • 3DOF rendering is the current state
  • more immersive audio rendering such as 3DOF+ and 6DOF rendering
  • 3DOF+ and 6DOF rendering may result in higher complexity in terms of processor cycles expended, memory and bandwidth consumed, etc.
  • rendering for 6DOF may require additional granularity in terms of pose (which may refer to position and/or orientation) that results in the higher complexity, while also complicating certain XR scenarios in terms of asynchronous capture of audio data and visual data.
  • XR scenes that involve live audio data capture (e.g., XR conferences, visual conferences, visual chat, metaverses, XR games, etc.) in which avatars (an example visual object) speak to one another using microphones to capture the live audio and convert such live audio into audio objects (which may also be referred to as audio elements as the audio objects are not necessarily defined in the object format).
  • 3DOF rendering may attempt to locate the audio elements into a general area of the corresponding avatar, providing loose colocation of audio elements to visual objects (which may also be referred to as visual elements). The lack of tighter colocation of audio elements relative to visual elements may reduce immersion and potentially result in difficulty interpreting the visual scene.
  • reference playback systems which may be referred to as reference architectures
  • various standards such as advanced coding (AC) fourth generation (AC-4), MPEG-H 3D audio coding standard, MPEG-I immersive coding standard, third generation partnership project (3GPP) standards, etc.
  • AC advanced coding
  • AC-4 third generation partnership project
  • 3GPP third generation partnership project
  • external Tenderers which may refer to Tenderers not configured to operate in the context of audio decoding
  • may be useful e.g., when rendering to a special class of devices that are not covered by built-in rendering (meaning rendering built-in to the audio decoding system).
  • the signal in the bitstream should normally be decoded and presented to this external Tenderer to avoid quality constraints compared to a rendering in the audio decoder to an intermediate format that is then re-rendered by the external Tenderer.
  • the playback system 16 may include a separate audio unit 50 that provides a separate audio interface to facilitate rendering at the playback system 16.
  • the techniques may enable the playback system 16 to synchronize playback of audio elements to playback of visual elements (which may refer to AR/VR/XR video elements, video data element, and/or any element meant to be viewed).
  • the audio unit 50 may include an interface (such as an application programming interface - API) that the audio unit 50 may expose in order to facilitate interactions with a scene manager 23 that manages playback of one or more visual elements that support an extended reality (XR) scene.
  • XR extended reality
  • the scene manager 23 may represent a unified interface for Tenderer components to access audio streams associated with an audio element in a so-called scene state.
  • the scene state may reflect a current state of all scene elements (e.g., video elements and/or audio elements), transforms/anchors, and geometry. Other components of the Tenderer may subscribe to changes in the scene state.
  • All elements in the entire scene are created and the associated metadata is updated to the state that reflects an intended scene configuration at a start of playback.
  • Audio streams are input to the Tenderer as PCM float samples.
  • the source of an audio stream may for example be decoded MPEG-H audio streams or locally captured audio.
  • the audio elements may not be captured at the same time as the visual elements or may be added later (e.g., during a XR mediated conference, such as an XR videoconference).
  • the playback system 16 may invoke the scene manager 23 to match one or more visual elements to one or more audio elements (e.g., by comparing a name or other unique identifier - UTD - associated with each of the one or more visual elements and one or more audio elements).
  • the scene manager 23 may modify audio metadata defining a pose (which may refer to a position and/or orientation) of the one or more audio elements to more closely correspond to the matching one or more visual elements.
  • the scene manager 23 may then output the modified audio metadata to an audio unit 50 of the audio playback system, which may render the audio elements to one or more speaker feeds 25.
  • the playback system 16 may then output the one or more speaker feeds 25 to one or more speakers (which may also be referred to as loudspeakers, headphone speakers, or more generally as transducers).
  • the scene manager 23 may map, based on visual metadata associated with the at least one visual element and audio metadata associated with the at least one audio element (both of which may be obtained by parsing the bitstreams 21 during decoding - e.g., by audio decoding device 24 for audio data specified via the bitstreams 21), the at least one visual element to the at least one audio element.
  • Scene manager 23 may identify a unique identifier (UID) and/or name in the visual metadata and audio metadata, comparing the UID and/or name (UID/name) associated with each visual element and each audio element to map one or more of the visual elements to one or more of the audio elements.
  • UID unique identifier
  • name UID/name
  • the scene manager 23 may next modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain modified audio metadata.
  • the audio unit 50 may present the API that the scene manager 23 may invoke to provide the modified audio metadata to the audio unit 50 via an API call.
  • the audio unit 50 may render, based on the modified audio metadata, the at least one audio element to the one or more speaker feeds 25, outputting the one or more speaker feeds 25 to one or more transducers (e.g., headphones, loudspeakers, etc.). That is, the audio unit 50 may configure one or more of the audio Tenderers 22 to based on pose information specified by the modified audio metadata.
  • the pose information (which may be denoted as a “pose” associated with the visual element and/or audio element) may define a location and an orientation (where both a location and an orientation is representative of 6DOF location information) of the corresponding element.
  • the pose of the audio element may not tightly correspond to the pose of the video element.
  • the scene manager 23 may modify the pose of the audio element in the audio metadata to more accurately reflect the pose of the visual element in the XR scene.
  • the API of the audio unit 50 may facilitate updating the pose of the audio element in near-real-time to reduce latency and thereby improve immersion of the XR scene (especially when consumed via a wearable such as the XR device 800).
  • the techniques may improve operation of the playback system 16 as the playback system 16 may more accurately reproduce a soundfield (based on the one or more speaker feeds) to potentially improve the immersive experience of XR systems, such as system 10. That is, rather than render the audio elements based on low resolution audio metadata that may not match the corresponding visual element, the playback system 16 may modify the audio metadata to more closely match the corresponding visual element, thereby increasing the immersion of the XR experience through higher resolution audio metadata. As such, various aspects of the techniques described in this disclosure may improve the playback system 16 itself.
  • FIG. IB is a block diagram illustrating another example system 100 configured to perform various aspects of the techniques described in this disclosure.
  • the system 100 is similar to the system 10 shown in FIG. 1 A, except that the audio Tenderers 22 shown in FIG. 1A are replaced with a binaural Tenderer 102 capable of performing binaural rendering using one or more HRTFs or the other functions capable of rendering to left and right speaker feeds 103.
  • the audio playback system 16B may output the left and right speaker feeds 103 to headphones 104, which may represent another example of a wearable device and which may be coupled to additional wearable devices to facilitate reproduction of the soundfield, such as a watch, the VR headset noted above, smart glasses, smart clothing, smart rings, smart bracelets or any other types of smart jewelry (including smart necklaces), and the like.
  • the headphones 104 may couple wirelessly or via wired connection to the additional wearable devices.
  • the headphones 104 may couple to the audio playback system 16 via a wired connection (such as a standard 3.5 mm audio jack, a universal system bus (USB) connection, an optical audio jack, or other forms of wired connection) or wirelessly (such as by way of a BluetoothTM connection, a wireless network connection, and the like).
  • the headphones 104 may recreate, based on the left and right speaker feeds 103, the soundfield represented by the ambisonic coefficients 11.
  • the headphones 104 may include a left headphone speaker and a right headphone speaker which are powered (or, in other words, driven) by the corresponding left and right speaker feeds 103.
  • FIGS. 2-4 are block diagrams illustrating example architectures of the playback system shown in the example of FIGS. 1A and/or IB for performing various aspects of the audio rendering techniques described in this disclosure.
  • a playback system 216 may represent an example of playback system 16.
  • the playback system 216 includes a runtime system 220, a media access function 222, a scene manager 23, and audio subsystem 50.
  • Runtime system 220 may represent a unit configured to support processing of sensor data, viewport rendering, as well as, simultaneous localization and mapping (SLAM) processing.
  • Runtime system 220 may operate with respect to a graphic language transmission format (glTFTM) that specifies visual elements as 3D scenes (which is one example of XR scenes).
  • Scene manager 23 may processing the glTFTM elements to unpack and use the underlying assets (which is another way to refer to visual elements and/or visual elements).
  • Scene manager 23 may also process audio bitstreams separate from the bitstream formatted according to glTFTM.
  • Media access function (MAF) 222 may represent a unit configured to obtain media content, such as visual bitstreams that specify at least one visual element and audio bitstreams that specify audio elements (or, in other words, audio source elements). MAF 222 may enable access to media data (which is another way to refer to the media content)to be communicated through various delivery networks (such as the Internet via wired, wireless, etc. communication networks including various cellular networks, such as fifth generation - 5G - cellular networks).
  • 3GPP TR 26.998 may represent one example standard by which to provide immersive XR media codecs/profiles by which to integrate glass-type XR devices into the 5G network.
  • the audio unit 50 may expose an API 51 by which to interface with audio unit 50 to register callbacks 53.
  • the callbacks 53 may represent a function that is passed into another function (e.g., the audio unit 50) as an argument to be executed later (e.g., prior to rendering each frame of the audio bitstream(s)).
  • the scene manager 23 may map, based on visual metadata associated with the at least one visual element 223 and audio metadata associated with the at least one audio element 225, the at least one visual element 223 to the at least one audio element 225.
  • the visual metadata may include a pose of the at least one visual element 223 in the XR scene
  • the audio metadata may include a pose of the at least one audio element in the XR scene.
  • the scene manager 23 may then modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain modified audio metadata. That is, the scene manager 23 may modify, based on the pose of the at least one visual element 223, the position of the at least one audio element 225 to obtain to obtain a modified position of the at least one audio element 225 in the XR scene.
  • the modified pose of the at least one audio element differs from the pose of the at least one audio element.
  • the modified pose of the at least one audio element 225 differs from the pose of the at least one audio element 225 in terms of a rotational angle, an elevation angle, and/or a translational distance.
  • the audio element 225 may define a primitive and/or audio meshes, which may be a low resolution (or, stated differently, low-detail) version of the corresponding video element 223.
  • scene manager 23 may perform in-band and/or out-of-band mapping.
  • scene manager 23 may analyze the audio metadata (defined in the audio bitstream) associated with the audio element 225 to identify an association between the audio element 225 and the video element 223 as defined by the node name in the glTFTM.
  • an extension to the glTF node may define an identifier plus an alignment transform to map the visual node (representative of video element 223) to the corresponding audio element 225.
  • the scene manager 23 may next register a callback 53 with the audio that includes a transformation for modifying the audio metadata to obtain the modified audio metadata.
  • the audio unit 50 may interface with the media access function 222 to obtain the at least one audio element 225, whereupon the audio unit 50 may render, based on the modified audio metadata (as represented by the callback 53), the at least one audio element to one or more speaker feeds 25/103.
  • the audio unit 50 may then output the one or more speaker feeds 25/103.
  • the scene manager 23 may, when mapping the visual elements 223 to the audio elements 225, determine that none of the at least one visual element 223 maps to the audio elements 225 (e.g., the audio element 225 is nondiegetic, meaning that the audio element 225 is not heard by the characters in the XR scene - such as transition or background music heard by viewers but not by the characters in the XR scene). In this instance, the scene manager 23 may set the unmatched audio element 225 to a general world coordinate system (e.g., (0, 0, 0) in an (X, Y, Z) coordinate system). The scene manager 23 may then register a callback 53 via API 51 of the audio unit 50 to maintain the general world position of the corresponding audio element 225.
  • a general world coordinate system e.g., (0, 0, 0
  • the audio unit 50 may render each frame of the audio element 225 based on this callback 53 setting the audio metadata for the audio element 225 to the general world coordinate system.
  • the audio unit 50 may render the audio element 225 as background audio to each of the speaker feeds 25/103.
  • the callback 53 may define a translation and effectively represents the modified audio metadata.
  • the audio unit 50 may request, responsive to the callback 53 and prior to rendering the at least one audio element 225, the modified audio metadata.
  • the audio unit 50 is configured to request, responsive to the callback and prior to rendering each frame of audio data for the at least one audio element 225, the modified audio metadata.
  • the relatively simple 3DOF model from 3 GPP technical specification (TS) 26.118 with a center point and the shared pose information is extended to operate in a reference space that facilitates the above mapping between visual element and audio elements to support 6DOF in XR systems.
  • the pose may no longer just be the head rotation but may also include the position of the user/camera/audio listener in the 6DOF XR space.
  • the scene itself may be more complex, including interactions, multiple elements, etc.
  • the audio unit 50 may provide for proper rendering using significantly more information that just the time synchronization information and pose as in 3DOF to properly render an immersive experience.
  • the audio listener 1) a type of audio listener, 2) a pose of the audio listener (which may be obtained via the XR device 800 and/or other sensors), and 3) an alignment of the head related transfer function (HRTF) with the underlying avatar representing the listener in the XR scene.
  • HRTF head related transfer function
  • For each audio source the following needs to be defined: 1) a type of audio source, 2) a corresponding visual element, 3) a pose with respect to a global coordinate system, and 4) timing information (which may be used to synchronize the audio element 225 with the corresponding video element 223 and a start/stop time based on a schedule or interactivity).
  • the pose is defined relative to a reference space defined/selected by scene manager 23.
  • XR anchoring and interactivity apply to all media types, where anchoring may associate a given video/audio element to a set location in the XR world, the real- world, etc.
  • the audio unit 50 may act as a so-called “black box” that exchanges information with the scene manager 23 via one or more APIs 51.
  • the audio unit 50 may operate in a 6DOF/XR audio-visual system 10.
  • the 6DOF/XR experience is described by a scene that may include one or more elements that have assigned both audio and visual properties.
  • the 6DOF XR audio-visual experience may include the ability to freely move at least in a restricted place.
  • the XR scene may allow for modification, updates, and interactions with the scene and/or elements in the scene such that the audio or visual properties may change.
  • Scene manager 23 (which may also be referred to as a presentation engine 23) may utilize hooks that modify or interject spatial audio metadata information, which is then used by the audio unit 50 to decode and render the 6DOF audio in synchronization with the visual experience (provided by way of the visual elements).
  • the so-called “hook” is realized in the form of callback 53, which may be periodic or on- demand (e.g., as required by the audio unit 50 and/or scene manager 23).
  • various aspects of the techniques may enable steering of a 6DOF audio unit. Furthermore, various aspects of the techniques may ensure alignment of the visual scene with the audio scene, driven by a single source while keeping the audio rendering process separate. MPEG-I Audio systems that conform to MPEG-I Audio 23090-4 may perform these techniques.
  • a playback system 236 may represent another example of the audio playback system 16.
  • the playback system 236 may be similar to the playback system 216 except that MAF 222 includes an audio processing system 55, which may also be referred to as an audio preprocessing system 55.
  • the audio preprocessing system 55 may expose the API 51 that scene manager 23 may invoke to interface and configure the audio preprocessing system 55 to modify the audio metadata associated with the audio element 225.
  • Scene manager 23 may configure the audio preprocessing system 55 to modify, based on the mapping of the at least one visual element 223 to the at least one audio element 225, the audio metadata to obtain the modified audio metadata.
  • the audio processing system 55 may insert scene update messages or rewrite existing messages to define the updated audio metadata associated with the audio element 225. In this way, the audio processing system 55 may update a listener pose and/or scene geometry associated with audio scene elements 225 (which is another way to refer to audio elements 225).
  • the audio processing system 55 may replace, based on the configuration (defined via the API 55), the audio metadata in the audio bitstream (defining audio element 225) with the modified audio metadata.
  • the audio processing system 55 may output the audio bitstream (which may be referred to as a modified 6D0F audio bitstream) to the audio unit 50.
  • a playback system 266 may represent another example of the audio playback system 16.
  • the playback system 266 may be similar to the playback system 216 and/or 236 except that the scene manager 23 may construct, based on the mapping of the at least one visual element 223 to the at least one audio element 225, a scene graph that includes a parent node representative of the at least one visual element 225, and a child node that depends from the parent node and that represents the at least one audio element 223.
  • the scene graph may be similar to the scene graphs that are defined via various gaming platforms, such as the Unity development platform.
  • Scene manager 23 may modify, based on the scene graph, the audio metadata to obtain the modified audio metadata.
  • the scene manager 23 is further configured to output the modified audio metadata to the audio unit 50.
  • the scene manager 23 is further configured to output, via API 51, exposed by the audio unit 50, the modified audio metadata to the audio unit 50.
  • the scene manager 23 may reduce latency associated with configuring the audio processing system 55 and/or configuring the callbacks 53.
  • the scene manager 23 may maintain a single scene and update the single scene instead of relying on separate sets of nodes for visual and audio elements.
  • the scene manager 23 may invoke the API 51 in order to provide the 6DOF audio metadata to the audio unit 50.
  • the scene manager 23 may provide inheritance of scene information (e.g., the gITFTM visual element graph) from the visual element 223 to the audio elements 225 that depend from the visual element 223 in the scene graph.
  • scene information e.g., the gITFTM visual element graph
  • An example scene description may be as follows:
  • Metadata For the MPEG-I Audio Bitstream, the following metadata may be defined:
  • the inherited audio metadata for the audio element 225 may be defined as follows:
  • interactivity where runtime system 220 captures user actions that modify an audio-visual element position.
  • the playback system 216 may require updating visual node in the scene manager 23 and an audio node separately by a callback 53 to the audio unit 50.
  • the audio pre-processing system 55 may insert the updated position into the 6DoF audio bitstream.
  • the scene manager 23 interfaces with the audio unit 50 with the modified audio metadata used for rendering.
  • the audio playback system 216 may not require a callback 53 to render nondiegetic audio.
  • the playback system 236 may bypass pre-processing of nondiegetic audio elements.
  • FIG. 6 illustrates an example of a wireless communications system 100 that supports audio streaming in accordance with aspects of the present disclosure.
  • the wireless communications system 100 includes base stations 105, UEs 115, and a core network 130.
  • the wireless communications system 100 may be a Long Term Evolution (LTE) network, an LTE- Advanced (LTE-A) network, an LTE-A Pro network, or a New Radio (NR) network.
  • LTE Long Term Evolution
  • LTE-A LTE- Advanced
  • NR New Radio
  • wireless communications system 100 may support enhanced broadband communications, ultra-reliable (e.g., mission critical) communications, low latency communications, or communications with low-cost and low-complexity devices.
  • ultra-reliable e.g., mission critical
  • Base stations 105 may wirelessly communicate with UEs 115 via one or more base station antennas.
  • Base stations 105 described herein may include or may be referred to by those skilled in the art as a base transceiver station, a radio base station, an access point, a radio transceiver, a NodeB, an eNodeB (eNB), a next-generation NodeB or giga- NodeB (either of which may be referred to as a gNB), a Home NodeB, a Home eNodeB, or some other suitable terminology.
  • Wireless communications system 100 may include base stations 105 of different types (e.g., macro or small cell base stations).
  • the Ues 115 described herein may be able to communicate with various types of base stations 105 and network equipment including macro eNBs, small cell eNBs, gNBs, relay base stations, and the like.
  • Each base station 105 may be associated with a particular geographic coverage area 110 in which communications with various Ues 115 is supported. Each base station 105 may provide communication coverage for a respective geographic coverage area 110 via communication links 125, and communication links 125 between a base station 105 and a UE 115 may utilize one or more carriers. Communication links 125 shown in wireless communications system 100 may include uplink transmissions from a UE 115 to a base station 105, or downlink transmissions from a base station 105 to a UE 115. Downlink transmissions may also be called forward link transmissions while uplink transmissions may also be called reverse link transmissions.
  • the geographic coverage area 110 for a base station 105 may be divided into sectors making up a portion of the geographic coverage area 110, and each sector may be associated with a cell.
  • each base station 105 may provide communication coverage for a macro cell, a small cell, a hot spot, or other types of cells, or various combinations thereof.
  • a base station 105 may be movable and therefore provide communication coverage for a moving geographic coverage area 110.
  • different geographic coverage areas 110 associated with different technologies may overlap, and overlapping geographic coverage areas 110 associated with different technologies may be supported by the same base station 105 or by different base stations 105.
  • the wireless communications system 100 may include, for example, a heterogeneous LTE/LTE-A/LTE-A Pro or NR network in which different types of base stations 105 provide coverage for various geographic coverage areas 110.
  • Ues 115 may be dispersed throughout the wireless communications system 100, and each UE 115 may be stationary or mobile.
  • a UE 115 may also be referred to as a mobile device, a wireless device, a remote device, a handheld device, or a subscriber device, or some other suitable terminology, where the “device” may also be referred to as a unit, a station, a terminal, or a client.
  • a UE 115 may also be a personal electronic device such as a cellular phone, a personal digital assistant (PDA), a tablet computer, a laptop computer, or a personal computer.
  • PDA personal digital assistant
  • a UE 115 may be any of the audio sources described in this disclosure, including a VR headset, an XR headset, an AR headset, a vehicle, a smartphone, a microphone, an array of microphones, or any other device including a microphone or is able to transmit a captured and/or synthesized audio stream.
  • an synthesized audio stream may be an audio stream that that was stored in memory or was previously created or synthesized.
  • a UE 115 may also refer to a wireless local loop (WLL) station, an Internet of Things (loT) device, an Internet of Everything (loE) device, or an MTC device, or the like, which may be implemented in various articles such as appliances, vehicles, meters, or the like.
  • WLL wireless local loop
  • LoT Internet of Things
  • laoE Internet of Everything
  • Some Ues 115 may be low cost or low complexity devices, and may provide for automated communication between machines (e.g., via Machine-to-Machine (M2M) communication).
  • M2M communication or MTC may refer to data communication technologies that allow devices to communicate with one another or a base station 105 without human intervention.
  • M2M communication or MTC may include communications from devices that exchange and/or use audio metadata indicating privacy restrictions and/or password-based privacy data to toggle, mask, and/or null various audio streams and/or audio sources as will be described in more detail below.
  • a UE 115 may also be able to communicate directly with other Ues 115 (e.g., using a peer-to-peer (P2P) or device-to-device (D2D) protocol).
  • P2P peer-to-peer
  • D2D device-to-device
  • One or more of a group of Ues 115 utilizing D2D communications may be within the geographic coverage area 110 of a base station 105.
  • Other Ues 115 in such a group may be outside the geographic coverage area 110 of a base station 105, or be otherwise unable to receive transmissions from a base station 105.
  • groups of Ues 115 communicating via D2D communications may utilize a one-to-many (1 :M) system in which each UE 115 transmits to every other UE 115 in the group.
  • a base station 105 facilitates the scheduling of resources for D2D communications.
  • D2D communications are carried out between Ues 115 without the involvement of a base station 105.
  • Base stations 105 may communicate with the core network 130 and with one another. For example, base stations 105 may interface with the core network 130 through backhaul links 132 (e.g., via an SI, N2, N3, or other interface). Base stations 105 may communicate with one another over backhaul links 134 (e.g., via an X2, Xn, or other interface) either directly (e.g., directly between base stations 105) or indirectly (e.g., via core network 130).
  • backhaul links 132 e.g., via an SI, N2, N3, or other interface
  • backhaul links 134 e.g., via an X2, Xn, or other interface
  • wireless communications system 100 may utilize both licensed and unlicensed radio frequency spectrum bands.
  • wireless communications system 100 may employ License Assisted Access (LAA), LTE-Unlicensed (LTE-U) radio access technology, or NR technology in an unlicensed band such as the 5 GHz ISM band.
  • LAA License Assisted Access
  • LTE-U LTE-Unlicensed
  • NR NR technology
  • an unlicensed band such as the 5 GHz ISM band.
  • wireless devices such as base stations 105 and Ues 115 may employ listen -before-talk (LBT) procedures to ensure a frequency channel is clear before transmitting data.
  • LBT listen -before-talk
  • operations in unlicensed bands may be based on a carrier aggregation configuration in conjunction with component carriers operating in a licensed band (e.g., LAA).
  • Operations in unlicensed spectrum may include downlink transmissions, uplink transmissions, peer-to-peer transmissions, or a combination of these.
  • Duplexing in unlicensed spectrum may be based on frequency division duplexing (FDD), time division duplexing (TDD), or a combination of both.
  • FDD frequency division duplexing
  • TDD time division duplexing
  • FIGS. 7 is a block diagram illustrating example architectures of the playback system shown in the example of FIGS. 1A and/or IB for performing various aspects of the audio rendering techniques described in this disclosure.
  • a playback system 716 may represent another example of the playback system 216 shown in the example of FIG. 2.
  • the playback system 716 may include a runtime system 720 (which is an example of the runtime system 220) that conforms to the openXRTM specification (or, in other words, the openXRTM standard).
  • the playback system 716 may also include an audio unit 750 (which represents an example of the audio unit 250) that may conform to the MPEG-I audio standard.
  • Audio unit 750 may expose an API 751, which may represent an example of the API 51.
  • the API 751 may also conform to the MPEG-I audio standard, which includes the functions defined in the table shown in the example of FIG. 11 that accepts the specified inputs (and may adhere to the description listed in the table of FIG. 11).
  • the API 751 may allow scene manager 23 to invoke an init function, a configure function, a start function, a pause function, a resume function, a stop function, an update function, an updateGraph function, and a registerCallback function.
  • the init function may accept an MPEG-I audio bitstream buffer/uniform resource locator (ULR) or a description of the immersive audio scene, which allows scene manager 23 to initialize the MPEG-I audio Tenderer (represented by the audio unit 750 in the example of FIG. 7) by providing an MPEG-I audio bitstream URL or buffer pointer (or alternatively, the audio unit 750 may be initialized by providing a description of the spatial audio scene, extracted from a scene description).
  • UTR MPEG-I audio bitstream buffer/uniform resource locator
  • the configure function may accept as inputs a time clock, a node mapping(s), a bounding box and coordinate system alignment, and/or XR spaces and AR anchors, which the scene manager 23 may invoke to specify an initial configuration of audio unit 750 with the potential goal to synchronize the audio scene to the visual scene but establishing a common timeline, exchanging node mappings, aligning the coordinate systems, and/or defining the XR spaces and anchors.
  • listener spaces descriptor file (LSDF) anchors may be aligned to MPEG anchors in gITF.
  • the start, pause, resume, and/or stop functions may accept, as inputs, an audio source identifier (ID), and/or an action time, which allows the scene manager 23 to control the playback of specific audio sources for interactivity purposes.
  • ID audio source identifier
  • the update function may accept, as inputs, an array of one or more of node identifier (ID), a translation, rotation, and scaling (TRS) matrix, and/or a timestamp.
  • ID node identifier
  • TRS translation, rotation, and scaling
  • the scene manager 23 may invoke the update function to update node positions and orientations in the audio scene, while the scene manager 23 may invoke the update function to provide the TRS matrix, which is relative to the initial pose at configuration time and is not incremental, where this may be a sequence of (TRS matrix, timestamp) to possibly support animation.
  • the updateGraph function may accept, as inputs, an add node (specified by way of one or more of a node identifier, a parent node identifier, and/or one or more properties), a remove node (specified by way of the node identifier), and/or an update node (specified by way of one or more of the node identifier and/or one or more properties).
  • the scene manager 23 may invoke the updateGraph function to add or remove a set of audio nodes to the internal representation of the audio scene graph in the audio Tenderer (of the audio unit 750).
  • the registerCallback function may accept, as inputs, a callback function and/or one or more events (e.g., NEED LISTENER POSE).
  • the scene manager 23 may invoke this callback function to provide hooks for the audio Tenderer (of the audio unit 750) to invoke when a certain event is detected, where, e.g., when the audio Tenderer needs an updated listener pose.
  • node mapping an implicit mapping mechanism may be assumed and it is the responsibility of the author to ensure proper and consistent node naming.
  • the node name property is used for the mapping, where mapping may only be applied at the node level.
  • a gITF node and an audio scene node are mapped together, which may mean that nodes are considered to be at the same hierarchical level (and not parent-child). For all nodes in the mapping (as one example), any changes to the nodes should trigger an update call (referencing that the scene manager 23 may invoke the update function in this example).
  • the mapping may convey a transformation that is applied to the audio node to align both nodes, where this transformation may be provided as new in-band scene metadata in the MPEG-I audio stream (possibly as the TRS matrix referenced above). If no transform is provided, it is assumed that the audio transform and corresponding visual node are aligned or, alternatively, a transform is derived from the initial transforms at alignment time, where, e.g., TRS visual' 1 multiplied by TRS audio.
  • TRS visual' 1 multiplied by TRS audio.
  • Other possible functions exposed by the API 751 may include functions related to an audio listener space and/or an XR space definition.
  • the scene manager 23 may obtain an understanding of the scene (e.g., a 3D reconstruction of the physical environment of the user) by interacting with the runtime system 720, where the scene manager 23 uses the update mechanism (or in other words, function) to update the LSDF.
  • the audio unit 750 may pass information about trackables that the audio Tenderer may track in the AR physical space, while the scene manager 23 may instruct the runtime system 720 to create a new application XR space and track the new application XR space, and retrieve or otherwise obtain the initial pose and share the initial pose with the audio render (of the audio unit 750).
  • the audio Tenderer may retrieve the actual pose of the trackable in that XR space by using the callback function referenced above.
  • nodes can be added to and removed from a graph, while one or more properties and components of a node may be updated.
  • a node is added/removed based on events in the app, where adding nodes in some examples may refer to a user inserting a 3D asset into the scene (e.g., a new user joins the shared space) and removing nodes in some examples may refer to a user leaving a shared space.
  • Node updates may affect the components of the node, where, for example, such updates may change a material of a node and/or change a geometry of the node.
  • an audio playback system 816 may represent an example of the audio playback system 716 shown in the example of FIG. 7, showing how the scene manager 23 may invoke graph update functions via an API 851 (which may represent an example of the API 751) exposed by an audio Tenderer 850 (which may represent an example of the audio unit 750).
  • an API 851 which may represent an example of the API 751 exposed by an audio Tenderer 850 (which may represent an example of the audio unit 750).
  • the scene manager 23 may invoke the update function noted above to remove nodes 872 (specifying a User #1 node, a Whiteboard node, an Avatar node, and an Audio source node) from the scene graph 870.
  • the scene manager 23 may invoke the update function to add nodes 874 (specifying a User #2 node, a Video Display node, an Avatar node, and an Audio source node) representative of User #2 to the scene graph 870, where the scene manager 23 may invoke the update function to update a subset of nodes 874 (denoted as nodes 876 specifying an Avatar node and an Audio source node) in the scene graph 870.
  • the scene manager 23 may receive constructed scene information from the runtime system 720, where the scene manager 23 may then extract acoustic relevant information and create an LSDF. If an LSDF is already available at the audio Tenderer 850, a mapping of identified objects is performed to align the LSDF to the reconstructed physical space (which may match bounding boxes by rotating, translating and scaling according to the TRS matrix the LSDF acoustic environment geometry). Physical anchors are mapped and/or aligned with trackables for the anchors may for instance be the user’s floor, a QR code, or a 2D image that are tracked by the runtime system 720.
  • the scene graph may only include audio elements or only include video elements, where various relationships between other audio elements or other video elements may be provided by way of the scene graph.
  • a separate visual scene graph may be mapped to a separate audio scene graph to identify a common scene graph, such as that described above with respect to FIG. 8.
  • an audio only scene graph may be used to obtain the modified audio metadata (which, as noted below, may represent a single example of audio descriptive information).
  • all reference to audio metadata herein may also be referred to as audio descriptive information.
  • FIG. 9 is a diagram illustrating example listener space descriptor file (LSDF) alignment according to various aspects of the techniques described in this disclosure.
  • the runtime system 720 may obtain a representation of the scene, where LSDF and LSDF updates are aligned to the audio coordinate system. Alignment may include matching trackables (e.g., QR code, image, floor, etc.) and/or scaling, rotation, and/or translation. LSDF is then used for audio rendering.
  • trackables e.g., QR code, image, floor, etc.
  • audio metadata may be specified in an audio bitstream representative of the audio element
  • audio descriptive information may be specified in various other side information, including via different transport formats associated with cellular communication standards, such as 3GPP 3G, 3GPP 4G, 3GPP 5G, 3GPP 6G, etc., various wireless networking standards, including personal area network standards (such as BluetoothTM), IEEE 802.11 family of standards, and the like, MPEG standards related to audio (e.g., MPEG-1, MPEG-2, MPEG-4, etc.).
  • audio descriptive information may be associated with the audio element but not necessarily transmitted in the same audio bitstream as the audio element, but instead as side information or other transport mechanisms.
  • FIG. 10 is a flowchart illustrating example operation of the content consumer device of FIG. 1 in performing various aspects of the techniques described in this disclosure.
  • the content consumer device 14 shown in the example of FIG. 1 may obtain a bitstream representative of at least one audio element in an extended reality scene, and audio descriptive information associated with the at least one audio element (1000).
  • the content consumer device 14 may also construct, based on the at least one audio element, a scene graph that includes at least one node that represents the at least one audio element (1002). The content consumer device 14 may further modifying, based on the scene graph, the audio descriptive information to obtain modified audio descriptive information (1004). The content consumer device 14 may next render, based on the modified audio descriptive information, the at least one audio element to one or more speaker feeds (1006), and output the one or more speaker feeds (1008). [0152] It is to be recognized that depending on the example, certain acts or events of any of the techniques described herein can be performed in a different sequence, may be added, merged, or left out altogether (e.g., not all described acts or events are necessary for the practice of the techniques). Moreover, in certain examples, acts or events may be performed concurrently, e.g., through multi -threaded processing, interrupt processing, or multiple processors, rather than sequentially.
  • the VR device (or the streaming device) may communicate, using a network interface coupled to a memory of the VR/streaming device, exchange messages to an external device, where the exchange messages are associated with the multiple available representations of the soundfield.
  • the VR device may receive, using an antenna coupled to the network interface, wireless signals including data packets, audio packets, visual packets, or transport protocol data associated with the multiple available representations of the soundfield.
  • one or more microphone arrays may capture the soundfield.
  • the multiple available representations of the soundfield stored to the memory device may include a plurality of object-based representations of the soundfield, higher order ambisonic representations of the soundfield, mixed order ambisonic representations of the soundfield, a combination of object-based representations of the soundfield with higher order ambisonic representations of the soundfield, a combination of object-based representations of the soundfield with mixed order ambisonic representations of the soundfield, or a combination of mixed order representations of the soundfield with higher order ambisonic representations of the soundfield.
  • one or more of the soundfield representations of the multiple available representations of the soundfield may include at least one high-resolution region and at least one lower-resolution region, and wherein the selected presentation based on the steering angle provides a greater spatial precision with respect to the at least one high- resolution region and a lesser spatial precision with respect to the lower-resolution region.
  • the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit.
  • Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol.
  • computer-readable media generally may correspond to (1) tangible computer-readable storage media which is non-transitory or (2) a communication medium such as a signal or carrier wave.
  • Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure.
  • a computer program product may include a computer-readable medium.
  • such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer- readable medium.
  • coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave are included in the definition of medium.
  • DSL digital subscriber line
  • computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory, tangible storage media.
  • Disk and disc includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc, where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
  • processors including fixed function processing circuitry and/or programmable processing circuitry, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or other equivalent integrated or discrete logic circuitry.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable gate arrays
  • processors may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein.
  • the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques could be fully implemented in one or more circuits or logic elements.
  • the techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or a set of ICs (e.g., a chip set).
  • IC integrated circuit
  • a set of ICs e.g., a chip set.
  • Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a collection of interoperative hardware units, including one or more processors as described above, in conjunction with suitable software and/or firmware.
  • Example 1A A device configured to process an audio bitstream, the device comprising: a memory configured to store a visual bitstream representative of at least one visual element in an extended reality scene and the audio bitstream representative of at least one audio element in the extended reality scene; and processing circuitry coupled of the memory and configured to: modify, based on the at least one visual element to the at least one audio element, audio metadata associated with the at least one audio element to obtain modified audio metadata; render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • Example 1.5A The device of example 1A, wherein the processing circuitry is further configured to map, based on visual metadata associated with the at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 2A The device of example 1 A, wherein the visual metadata includes a pose of the at least one visual element in the extended reality scene, wherein the audio metadata includes a pose of the at least one audio element in the extended reality scene, and wherein the processing circuitry is configured to modify, based on the pose of the at least one visual element, the position of the at least one audio element to obtain to obtain a modified position of the at least one audio element in the extended reality scene.
  • Example 3A The device of example 2A, wherein the modified pose of the at least one audio element differs from the pose of the at least one audio element.
  • Example 4A The device of any combination of examples 2A and 3A, wherein the modified pose of the at least one audio element differs from the pose of the at least one audio element in terms of a rotational angle.
  • Example 5A The device of any combination of examples 2A-4A, wherein the modified pose of the at least one audio element differs from the pose of the at least one audio element in terms of a translational distance.
  • Example 6A The device of any combination of examples 1 A-5A, wherein the at least one audio element includes a first audio element and a second audio element, wherein the processing circuitry is configured to map, based on visual metadata associated with the at least one visual element and audio metadata associated with the first audio element, the at least one visual element to the first audio element, wherein the processing circuitry is further configured to: determine that none of the at least one visual element maps to the second audio element; and render, based on the audio metadata associated with the second audio element, the second audio element to the one or more speaker feeds.
  • Example 7A The device of any combination of examples 1A-6A, wherein the visual metadata includes an identifier that uniquely identifies the at least one visual element, wherein the audio metadata includes an identifier that uniquely identifies the at least one audio element, and wherein the processing circuitry is configured to map, based on the identifier that uniquely identifies the at least one visual element and the identifier that uniquely identifies the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 8A The device of example 7A, wherein the identifier that uniquely identifies the at least one visual element includes one or more of a visual element identifier and a visual element name, and wherein the identifier that uniquely identifies the at least one audio element includes one or more of an audio element identifier and an audio element name.
  • Example 9A The device of any combination of examples 1A-8A, further comprising one or more speakers configured to reproduce, based on the one or more speaker feeds, a soundfield.
  • Example 10A The device of any combination of examples 1A-9A, wherein the processing circuitry is further configured to execute a scene manager and an audio unit, wherein the scene manager is configured to map, based on the visual metadata associated with the at least one visual element and the audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element, and wherein the audio unit is configured to render, based on the modified audio metadata, the at least one audio element to the one or more speaker feeds.
  • the scene manager is configured to map, based on the visual metadata associated with the at least one visual element and the audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element
  • the audio unit is configured to render, based on the modified audio metadata, the at least one audio element to the one or more speaker feeds.
  • Example 11 A The device of example 10A, wherein the scene manager is configured to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata, and wherein the scene manager is further configured to register, with the audio unit, a callback by which the audio unit is configured to request the modified audio metadata prior to rendering the at least one audio element.
  • Example 12A The device of example 11 A, wherein the scene manager registers the callback via an application programming interface exposed by the audio unit.
  • Example 13A The device of any combination of examples 11 A and 12A, wherein the audio unit is configured to request, responsive to the callback and prior to rendering the at least one audio element, the modified audio metadata.
  • Example 14A The device of any combination of examples 11 A-13A, wherein the audio unit is configured to request, responsive to the callback and prior to rendering each frame of audio data for the at least one audio element, the modified audio metadata.
  • Example 15 A The device of example 10A, wherein the processing circuitry is further configured to execute an audio processing unit, wherein the scene manager is configured to configure the audio processing unit to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata, and wherein the audio processing unit is configured to: replace, based on the configuration, the audio metadata in the audio bitstream with the modified audio metadata; and output the audio bitstream to the audio unit.
  • the scene manager is configured to configure the audio processing unit to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata
  • the audio processing unit is configured to: replace, based on the configuration, the audio metadata in the audio bitstream with the modified audio metadata; and output the audio bitstream to the audio unit.
  • Example 16A The device of example 15 A, wherein the scene manager is configured, via an application programming interface exposed by the audio processing unit, to configure the audio processing unit to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata.
  • Example 17A The device of example 10A, wherein the scene manager is further configured to: construct, based on the mapping of the at least one visual element to the at least one audio element, a scene graph that includes a parent node representative of the at least one visual element, and a child node that depends from the parent node and that represents the at least one audio element; and modify, based on the scene graph, the audio metadata to obtain modified audio metadata.
  • Example 18 A The device of example 17A, wherein the scene manager is further configured to output the modified audio metadata to the audio unit.
  • Example 19A The device of example 17A, wherein the scene manager is further configured to output, via an application programming interface exposed by the audio unit, the modified audio metadata to the audio unit.
  • Example 20A A method of processing at least one audio element, the method comprising: modifying, based on the at least one visual element to the at least one audio element, audio metadata associated with the at least one audio element to obtain modified audio metadata; rendering, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and outputting the one or more speaker feeds.
  • Example 21.5A The method of example 1A, further comprising mapping, based on visual metadata associated with the at least one visual element and audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 21 A The method of example 20A, wherein the visual metadata includes a pose of the at least one visual element in the extended reality scene, wherein the audio metadata includes a pose of the at least one audio element in the extended reality scene, and wherein modifying the audio metadata comprises modifying, based on the pose of the at least one visual element, the position of the at least one audio element to obtain to obtain a modified position of the at least one audio element in the extended reality scene.
  • Example 22A The method of example 21A, wherein the modified pose of the at least one audio element differs from the pose of the at least one audio element.
  • Example 23 A The method of any combination of example 21 A and 22A, wherein the modified pose of the at least one audio element differs from the pose of the at least one audio element in terms of a rotational angle.
  • Example 24A The method of any combination of examples 21A-23A, wherein the modified pose of the at least one audio element differs from the pose of the at least one audio element in terms of a translational distance.
  • Example 25A The method of any combination of examples 20A-24A, wherein the at least one audio element includes a first audio element and a second audio element, wherein the method further comprises: mapping, based on visual metadata associated with the at least one visual element and audio metadata associated with the first audio element, the at least one visual element to the first audio element; determining that none of the at least one visual element maps to the second audio element; and rendering, based on the audio metadata associated with the second audio element, the second audio element to the one or more speaker feeds.
  • Example 26A The method of any combination of examples 20A-25A, wherein the visual metadata includes an identifier that uniquely identifies the at least one visual element, wherein the audio metadata includes an identifier that uniquely identifies the at least one audio element, and wherein the method further comprises mapping, based on the identifier that uniquely identifies the at least one visual element and the identifier that uniquely identifies the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 27A The method of example 26A, wherein the identifier that uniquely identifies the at least one visual element includes one or more of a visual element identifier and a visual element name, and wherein the identifier that uniquely identifies the at least one audio element includes one or more of an audio element identifier and an audio element name.
  • Example 28A The method of any combination of examples 20A-27A, further comprising reproducing, by one or more speakers and based on the one or more speaker feeds, a soundfield.
  • Example 29A The method of any combination of examples 20A-28A, further comprising executing a scene manager and an audio subsystem, wherein the scene manager is configured to map, based on the visual metadata associated with the at least one visual element and the audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element, and wherein the audio unit is configured to render, based on the modified audio metadata, the at least one audio element to the one or more speaker feeds.
  • Example 30A The method of example 29A, wherein modifying the audio metadata comprises executing the scene manager to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata, and wherein the method further comprises registering, by the scene manager and with the audio unit, a callback by which the audio unit is configured to request the modified audio metadata prior to rendering the at least one audio element.
  • Example 31 A The method of example 30A, further comprising registering the callback via an application programming interface exposed by the audio unit.
  • Example 32A The method of any combination of examples 30A and 31 A, further comprising requesting, responsive to the callback and prior to rendering the at least one audio element, the modified audio metadata.
  • Example 33A The method of any combination of examples 30A-32A, further comprising requesting, responsive to the callback and prior to rendering each frame of audio data for the at least one audio element, the modified audio metadata.
  • Example 34A The method of example 29A, further comprising configuring the audio processing unit to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata, and wherein the method further comprises: replacing, based on the configuration, the audio metadata in the audio bitstream with the modified audio metadata; and outputting the audio bitstream to the audio unit.
  • Example 35 A The method of example 34A, wherein configuring the audio processing unit comprises configuring, via an application programming interface exposed by the audio processing unit, to configure the audio processing unit to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata.
  • Example 36A The method of example 29A, further comprising: constructing, based on the mapping of the at least one visual element to the at least one audio element, a scene graph that includes a parent node representative of the at least one visual element, and a child node that depends from the parent node and that represents the at least one audio element; and modifying, based on the scene graph, the audio metadata to obtain modified audio metadata.
  • Example 37A The method of example 36A, further comprising outputting the modified audio metadata to the audio unit.
  • Example 38 A The method of example 36A, further comprising outputting, via an application programming interface exposed by the audio unit, the modified audio metadata to the audio unit.
  • Example 39A A non-transitory computer-readable storage medium having instructions stored thereon that, when executed, cause one or more processors to: modify, based on the at least one visual element to the at least one audio element, audio metadata associated with the at least one audio element to obtain modified audio metadata; render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • Example IB A device configured to process an audio bitstream, the device comprising: a memory configured to store a visual bitstream representative of at least one visual element in an extended reality scene and the audio bitstream representative of at least one audio element in the extended reality scene; and processing circuitry coupled of the memory and configured to execute a scene manager and an audio unit, wherein the scene manager is configured to: modify, based on the at least one visual element to the at least one audio element, audio metadata associated with the at least one audio element to obtain modified audio metadata; and register, with the audio unit, a callback by which the audio unit is configured to request the modified audio metadata prior to rendering the at least one audio element, and wherein the audio unit is configured to: render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • Example 1.5B The device of example IB, wherein the scene manager is configured to map, based on visual metadata associated with the at least one visual element and the audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 2B The device of example IB, wherein the visual metadata includes a position of the at least one visual element in the extended reality scene, wherein the audio metadata includes a position of the at least one audio element in the extended reality scene, and wherein the scene manager is configured to modify, based on the position of the at least one visual element, the position of the at least one audio element to obtain to obtain a modified position of the at least one audio element in the extended reality scene.
  • Example 3B The device of example 2B, wherein the modified position of the at least one audio element differs from the position of the at least one audio element.
  • Example 4B The device of any combination of example 2B and 3B, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a rotational angle.
  • Example 5B The device of any combination of examples 2B-4B, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a translational distance.
  • Example 6B The device of any combination of examples 1B-5B, wherein the at least one audio element includes a first audio element and a second audio element, wherein the scene manager is configured to map, based on visual metadata associated with the at least one visual element and audio metadata associated with the first audio element, the at least one visual element to the first audio element, wherein the scene manager is further configured to: determine that none of the at least one visual element maps to the second audio element; and render, based on the audio metadata associated with the second audio element, the second audio element to the one or more speaker feeds.
  • Example 7B Example 7B.
  • the visual metadata includes an identifier that uniquely identifies the at least one visual element
  • the audio metadata includes an identifier that uniquely identifies the at least one audio element
  • the scene manager is configured to map, based on the identifier that uniquely identifies the at least one visual element and the identifier that uniquely identifies the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 8B The device of example 7B, wherein the identifier that uniquely identifies the at least one visual element includes one or more of a visual element identifier and a visual element name, and wherein the identifier that uniquely identifies the at least one audio element includes one or more of an audio element identifier and an audio element name.
  • Example 9B The device of any combination of examples 1B-8B, further comprising one or more speakers configured to reproduce, based on the one or more speaker feeds, a soundfield.
  • Example 10B The device of any combination of examples 1B-9B, wherein the scene manager registers the callback via an application programming interface exposed by the audio unit.
  • Example 11B The device of any combination of examples 1B-10B, wherein the audio unit is configured to request, responsive to the callback and prior to rendering the at least one audio element, the modified audio metadata.
  • Example 12B The device of any combination of examples 1B-11B, wherein the audio unit is configured to request, responsive to the callback and prior to rendering each frame of audio data for the at least one audio element, the modified audio metadata.
  • Example 13B A method of processing at least one audio element, the method comprising: modifying, by a scene manager executed by processing circuitry and based the at least one visual element and the at least one audio element, audio metadata associated with the at least one audio element to obtain modified audio metadata; rendering, by an audio unit executed by the processing circuitry and based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and outputting, by the audio unit, the one or more speaker feeds.
  • Example 13.5B The method of example 13B, further comprising mapping, by the scene manager and based on visual metadata associated with at least one visual element and the audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element;
  • Example 14B The method of example 13B, wherein the visual metadata includes a position of the at least one visual element in the extended reality scene, wherein the audio metadata includes a position of the at least one audio element in the extended reality scene, and wherein modifying the audio metadata comprises modifying, based on the position of the at least one visual element, the position of the at least one audio element to obtain to obtain a modified position of the at least one audio element in the extended reality scene.
  • Example 15B The method of example 14B, wherein the modified position of the at least one audio element differs from the position of the at least one audio element.
  • Example 16B The method of any combination of example 14B and 15B, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a rotational angle.
  • Example 17B The method of any combination of examples 14B-16B, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a translational distance.
  • Example 18B The method of any combination of examples 13B-17B, wherein the at least one audio element includes a first audio element and a second audio element, wherein the method further comprises: mapping, based on visual metadata associated with the at least one visual element and audio metadata associated with the first audio element, the at least one visual element to the first audio element, determining that none of the at least one visual element maps to the second audio element; and rendering, based on the audio metadata associated with the second audio element, the second audio element to the one or more speaker feeds.
  • Example 19B The method of any combination of examples 13B-18B, wherein the visual metadata includes an identifier that uniquely identifies the at least one visual element, wherein the audio metadata includes an identifier that uniquely identifies the at least one audio element, and wherein the method further comprises mapping, based on the identifier that uniquely identifies the at least one visual element and the identifier that uniquely identifies the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 20B The method of example 19B, wherein the identifier that uniquely identifies the at least one visual element includes one or more of a visual element identifier and a visual element name, and wherein the identifier that uniquely identifies the at least one audio element includes one or more of an audio element identifier and an audio element name.
  • Example 21B The method of any combination of examples 13B-20B, further comprising reproducing, by one or more speakers and based on the one or more speaker feeds, a soundfield.
  • Example 22B The method of any combination of examples 13B-21B, further comprising registering the callback via an application programming interface exposed by the audio unit.
  • Example 23B The method of any combination of examples 13B-22B, further comprising requesting, responsive to the callback and prior to rendering the at least one audio element, the modified audio metadata.
  • Example 24B The method of any combination of examples 13B-23B, further comprising requesting, responsive to the callback and prior to rendering each frame of audio data for the at least one audio element, the modified audio metadata.
  • Example 25B A non-transitory computer-readable storage medium having instructions stored thereon that, when executed, cause one or more processors to: execute a scene manager configured to modify, based on the at least one visual element and the at least one audio element, audio metadata associated with the at least one audio element to obtain modified audio metadata; and execute an audio unit configured to: render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • a scene manager configured to modify, based on the at least one visual element and the at least one audio element, audio metadata associated with the at least one audio element to obtain modified audio metadata
  • an audio unit configured to: render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • Example 1C A device configured to process an audio bitstream, the device comprising: a memory configured to store a visual bitstream representative of at least one visual element in an extended reality scene and the audio bitstream representative of at least one audio element in the extended reality scene; and processing circuitry coupled of the memory and configured to execute a scene manager, an audio processing unit, and an audio unit, wherein the scene manager is configured to: modify, based on the at least one visual element and the at least one audio element, audio metadata associated with the at least one audio element to obtain modified audio metadata; and configure the audio processing unit to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata, wherein the audio processing unit is configured to: replace, based on the configuration, the audio metadata in the audio bitstream with the modified audio metadata; and output the audio bitstream to the audio unit, and wherein the audio unit is configured to: render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and output the one
  • Example 2C The device of example 1C, wherein the visual metadata includes a position of the at least one visual element in the extended reality scene, wherein the audio metadata includes a position of the at least one audio element in the extended reality scene, and wherein the scene manager is configured to modify, based on the position of the at least one visual element, the position of the at least one audio element to obtain to obtain a modified position of the at least one audio element in the extended reality scene.
  • Example 3C The device of example 2C, wherein the modified position of the at least one audio element differs from the position of the at least one audio element.
  • Example 4C The device of any combination of example 2C and 3C, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a rotational angle.
  • Example 5C The device of any combination of examples 2C-4C, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a translational distance.
  • Example 6C The device of any combination of examples 1C-5C, wherein the at least one audio element includes a first audio element and a second audio element, wherein the scene manager is further configured to: map, based on visual metadata associated with the at least one visual element and the audio metadata associated with the first audio element, the at least one visual element to the first audio element; determine that none of the at least one visual element maps to the second audio element; and render, based on the audio metadata associated with the second audio element, the second audio element to the one or more speaker feeds.
  • Example 7C The device of any combination of examples 1C-6C, wherein the visual metadata includes an identifier that uniquely identifies the at least one visual element, wherein the audio metadata includes an identifier that uniquely identifies the at least one audio element, and wherein the scene manager is configured to map, based on the identifier that uniquely identifies the at least one visual element and the identifier that uniquely identifies the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 8C The device of example 7C, wherein the identifier that uniquely identifies the at least one visual element includes one or more of a visual element identifier and a visual element name, and wherein the identifier that uniquely identifies the at least one audio element includes one or more of an audio element identifier and an audio element name.
  • Example 9C The device of any combination of examples 1C-8C, further comprising one or more speakers configured to reproduce, based on the one or more speaker feeds, a soundfield.
  • Example 10C The device of any combination of examples 1C-9C, wherein the scene manager is configured, via an application programming interface exposed by the audio processing unit, to configure the audio processing unit to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata.
  • Example 11C A method of processing at least one audio element, the method comprising: modifying, by a scene manager executed by processing circuitry and based on the at least one visual element and the at least one audio element, audio metadata associated with the at least one audio element to obtain modified audio metadata; configuring, by the scene manager, an audio processing unit to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata, replacing, by the audio processing unit and based on the configuration, the audio metadata in the audio bitstream with the modified audio metadata; and outputting, by the audio processing unit, the audio bitstream to an audio unit executed by the processing circuitry; rendering, by the audio unit and based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and outputting, by the audio unit, the one or more speaker feeds.
  • Example 11.5C The method of example 11C, further comprising mapping, by the scene manager and based on visual metadata associated with at least one visual element and the audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 12C The method of example 11C, wherein the visual metadata includes a position of the at least one visual element in the extended reality scene, wherein the audio metadata includes a position of the at least one audio element in the extended reality scene, and wherein modifying the audio metadata comprises modifying, based on the position of the at least one visual element, the position of the at least one audio element to obtain to obtain a modified position of the at least one audio element in the extended reality scene.
  • Example 13C The method of example 12C, wherein the modified position of the at least one audio element differs from the position of the at least one audio element.
  • Example 14C The method of any combination of example 12C and 13C, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a rotational angle.
  • Example 15C The method of any combination of examples 12C-14C, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a translational distance.
  • Example 16C The method of any combination of examples 11C-15C, wherein the at least one audio element includes a first audio element and a second audio element, wherein the method further comprises: mapping, based on visual metadata associated with the at least one visual element and the audio metadata associated with the first audio element, the at least one visual element to the first audio element; determining that none of the at least one visual element maps to the second audio element; and rendering, based on the audio metadata associated with the second audio element, the second audio element to the one or more speaker feeds.
  • Example 17C The method of any combination of examples 11C-16C, wherein the visual metadata includes an identifier that uniquely identifies the at least one visual element, wherein the audio metadata includes an identifier that uniquely identifies the at least one audio element, and wherein the method further comprises mapping, based on the identifier that uniquely identifies the at least one visual element and the identifier that uniquely identifies the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 18C The method of example 17C, wherein the identifier that uniquely identifies the at least one visual element includes one or more of a visual element identifier and a visual element name, and wherein the identifier that uniquely identifies the at least one audio element includes one or more of an audio element identifier and an audio element name.
  • Example 19C The method of any combination of examples 11C-18C, further comprising reproducing, by one or more speakers and based on the one or more speaker feeds, a soundfield.
  • Example 20C The method of any combination of examples 11C-19C, wherein configuring the audio processing unit comprises configuring, via an application programming interface exposed by the audio processing unit, the audio processing unit to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata.
  • Example 21 C A non-transitory computer-readable storage medium having instructions stored thereon that, when executed, cause one or more processors to: execute a scene manager configured to modify, based on the at least one visual element and the at least one audio element, the audio metadata to obtain modified audio metadata, and configure an audio processing unit to modify, based on the mapping of the at least one visual element to the at least one audio element, the audio metadata to obtain the modified audio metadata; execute the audio processing unit to replace, based on the configuration, the audio metadata in the audio bitstream with the modified audio metadata, and output the audio bitstream to the audio unit; and execute an audio unit configured to render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds, and output the one or more speaker feeds.
  • Example ID A device configured to process an audio bitstream, the device comprising: a memory configured to store a visual bitstream representative of at least one visual element in an extended reality scene and the audio bitstream representative of at least one audio element in the extended reality scene, the audio bitstream includes audio metadata; and processing circuitry coupled of the memory and configured to execute a scene manager and an audio unit, wherein the scene manager is configured to: construct, based on the at least one visual element and the at least one audio element, a scene graph that includes a parent node representative of the at least one visual element, and a child node that depends from the parent node and that represents the at least one audio element; and modify, based on the scene graph, the audio metadata to obtain modified audio metadata, and wherein the audio unit is configured to: render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • the scene manager is configured to: construct, based on the at least one visual element and the at least one audio element, a scene graph that
  • Example 1.5D The device of example ID, wherein the scene manager is further configured to map, based on visual metadata associated with the at least one visual element and the audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element;
  • Example 2D The device of example ID, wherein the visual metadata includes a position of the at least one visual element in the extended reality scene, wherein the audio metadata includes a position of the at least one audio element in the extended reality scene, and wherein the scene manager is configured to modify, based on the position of the at least one visual element, the position of the at least one audio element to obtain to obtain a modified position of the at least one audio element in the extended reality scene.
  • Example 3D The device of example 2D, wherein the modified position of the at least one audio element differs from the position of the at least one audio element.
  • Example 4D The device of any combination of example 2D and 3D, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a rotational angle.
  • Example 5D The device of any combination of examples 2D-4D, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a translational distance.
  • Example 6D The device of any combination of examples 1D-5D, wherein the at least one audio element includes a first audio element and a second audio element, and wherein the scene manager is further configured to: map, based on visual metadata associated with the at least one visual element and audio metadata associated with the first audio element, the at least one visual element to the first audio element; determine that none of the at least one visual element maps to the second audio element; and render, based on the audio metadata associated with the second audio element, the second audio element to the one or more speaker feeds.
  • Example 7D The device of any combination of examples 1D-6D, wherein the visual metadata includes an identifier that uniquely identifies the at least one visual element, wherein the audio metadata includes an identifier that uniquely identifies the at least one audio element, and wherein the scene manager is configured to map, based on the identifier that uniquely identifies the at least one visual element and the identifier that uniquely identifies the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 8D Example 8D.
  • the identifier that uniquely identifies the at least one visual element includes one or more of a visual element identifier and a visual element name
  • the identifier that uniquely identifies the at least one audio element includes one or more of an audio element identifier and an audio element name
  • Example 9D The device of any combination of examples 1D-8D, further comprising one or more speakers configured to reproduce, based on the one or more speaker feeds, a soundfield.
  • Example 10D The device of any combination of examples 1D-9D, wherein the scene manager is further configured to output the modified audio metadata to the audio unit.
  • Example 1 ID The device of any combination of examples 1D-9D, wherein the scene manager is further configured to output, via an application programming interface exposed by the audio unit, the modified audio metadata to the audio unit.
  • Example 12D A method of processing at least one audio element, the method comprising: constructing, by a scene manager executed by processing circuitry and based on the at least one visual element and the at least one audio element, a scene graph that includes a parent node representative of the at least one visual element, and a child node that depends from the parent node and that represents the at least one audio element; modifying, by the scene manager and based on the scene graph, audio metadata associated with the at least one audio element to obtain modified audio metadata; rendering, by an audio unit executed by the processing circuitry and based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and outputting, by the audio unit, the one or more speaker feeds.
  • Example 12.5D The method of example 12D, further comprising mapping, by the scene manager and based on visual metadata associated with at least one visual element and the audio metadata associated with the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 13D The method of example 12D, wherein the visual metadata includes a position of the at least one visual element in the extended reality scene, wherein the audio metadata includes a position of the at least one audio element in the extended reality scene, and wherein modifying the audio metadata comprises modifying, based on the position of the at least one visual element, the position of the at least one audio element to obtain to obtain a modified position of the at least one audio element in the extended reality scene.
  • Example 14D The method of example 13D, wherein the modified position of the at least one audio element differs from the position of the at least one audio element.
  • Example 15D The method of any combination of example 13D and 14D, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a rotational angle.
  • Example 16D The method of any combination of examples 13D-15D, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a translational distance.
  • Example 17D The method of any combination of examples 12D-16D, wherein the at least one audio element includes a first audio element and a second audio element, and wherein the method further comprises: mapping, based on visual metadata associated with the at least one visual element and audio metadata associated with the first audio element, the at least one visual element to the first audio element; and determining that none of the at least one visual element maps to the second audio element; and rendering, based on the audio metadata associated with the second audio element, the second audio element to the one or more speaker feeds.
  • Example 18D The method of any combination of examples 12D-17D, wherein the visual metadata includes an identifier that uniquely identifies the at least one visual element, wherein the audio metadata includes an identifier that uniquely identifies the at least one audio element, and wherein the method further comprises mapping, based on the identifier that uniquely identifies the at least one visual element and the identifier that uniquely identifies the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 19D The method of example 18D, wherein the identifier that uniquely identifies the at least one visual element includes one or more of a visual element identifier and a visual element name, and wherein the identifier that uniquely identifies the at least one audio element includes one or more of an audio element identifier and an audio element name.
  • Example 20D The method of any combination of examples 12D-19D, further comprising reproducing, by one or more speakers and based on the one or more speaker feeds, a soundfield.
  • Example 21D The method of any combination of examples 12D-20D, further comprising outputting the modified audio metadata to the audio unit.
  • Example 22D The method of any combination of examples 12D-20D, further comprising outputting, via an application programming interface exposed by the audio unit, the modified audio metadata to the audio unit.
  • Example 23D A non-transitory computer-readable storage medium having instructions stored thereon that, when executed, cause one or more processors to: execute a scene manager configured to: construct, by the scene manager and based on the at least one visual element and the at least one audio element, a scene graph that includes a parent node representative of the at least one visual element, and a child node that depends from the parent node and that represents the at least one audio element; and modify, by the scene manager and based on the scene graph, the audio metadata to obtain modified audio metadata; and execute an audio unit configured to: render, based on the modified audio metadata, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • a scene manager configured to: construct, by the scene manager and based on the at least one visual element and the at least one audio element, a scene graph that includes a parent node representative of the at least one visual element, and a child node that depends from the parent node and that represents the at least one audio element; and modify, by the
  • Example IE A device configured to process a bitstream, the device comprising: a memory configured to store the bitstream representative of at least one audio element in the extended reality scene, and audio descriptive information associated with the at least one audio element; and processing circuitry coupled of the memory and configured to execute a scene manager and an audio unit, wherein the scene manager is configured to: construct, based on the at least one audio element, a scene graph that includes at least one node that represents the at least one audio element; and modify, based on the scene graph, the audio descriptive information to obtain modified audio descriptive information, and wherein the audio unit is configured to: render, based on the modified audio descriptive information, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.
  • the scene manager is configured to: construct, based on the at least one audio element, a scene graph that includes at least one node that represents the at least one audio element; and modify, based on the scene graph, the audio descriptive information to obtain modified audio descriptive information
  • the audio unit is configured to: render,
  • Example 2E The device of example IE, wherein the scene manager is further configured to obtain at least one visual element, and wherein the scene manager is configured to construct, based on the at least one audio element and the at least one video element, the scene graph that includes a parent node representative of the at least one visual element, and a child node that depends from the parent node and that represents the at least one audio element.
  • Example 3E The device of example 2E, wherein the scene manager is configured to align the at least one visual element and the at least one audio element when constructing the scene graph.
  • Example 4E The device of example 1E-3E, wherein the scene manager is further configured to update the scene graph to add, remove, or edit the at least one node that represents the at least one audio element.
  • Example 5E The device of example 2E, wherein the scene manager is further configured to map, based on visual descriptive information associated with the at least one visual element and the audio descriptive information associated with the at least one audio element, the at least one visual element to the at least one audio element;
  • Example 6E The device of example 5E, wherein the visual descriptive information includes a position of the at least one visual element in the extended reality scene, wherein the audio descriptive information includes a position of the at least one audio element in the extended reality scene, and wherein the scene manager is configured to modify, based on the position of the at least one visual element, the position of the at least one audio element to obtain to obtain a modified position of the at least one audio element in the extended reality scene.
  • Example 7E The device of example 6E, wherein the modified position of the at least one audio element differs from the position of the at least one audio element.
  • Example 8E The device of any combination of example 6E and 7E, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a rotational angle.
  • Example 9E The device of any combination of examples 6E-8E, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a translational distance.
  • Example 10E The device of any combination of examples 5E-9E, wherein the at least one audio element includes a first audio element and a second audio element, wherein the scene manager is further configured to: map, based on visual descriptive information associated with the at least one visual element and audio descriptive information associated with the first audio element, the at least one visual element to the first audio element; determine that none of the at least one visual element maps to the second audio element; and render, based on the audio descriptive information associated with the second audio element, the second audio element to the one or more speaker feeds.
  • Example HE Example HE.
  • the visual descriptive information includes an identifier that uniquely identifies the at least one visual element
  • the audio descriptive information includes an identifier that uniquely identifies the at least one audio element
  • the scene manager is configured to map, based on the identifier that uniquely identifies the at least one visual element and the identifier that uniquely identifies the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 12E The device of example HE, wherein the identifier that uniquely identifies the at least one visual element includes one or more of a visual element identifier and a visual element name, and wherein the identifier that uniquely identifies the at least one audio element includes one or more of an audio element identifier and an audio element name.
  • Example 13E The device of any combination of examples 5E-12E, further comprising one or more speakers configured to reproduce, based on the one or more speaker feeds, a soundfield.
  • Example 14E The device of any combination of examples 5E-13E, wherein the scene manager is further configured to output the modified audio metadata to the audio unit.
  • Example 15E The device of any combination of examples 5E-13E, wherein the scene manager is further configured to output, via an application programming interface exposed by the audio unit, the modified audio metadata to the audio unit.
  • Example 15.5E The device of any combination of examples 1E-15E, wherein the bitstream is transmitted according to one or more of a wireless network protocol, a personal area network protocol, and a cellular network protocol.
  • Example 16E A method comprising: obtaining a bitstream representative of at least one audio element in an extended reality scene, and audio descriptive information associated with the at least one audio element; and constructing, based on the at least one audio element, a scene graph that includes at least one node that represents the at least one audio element; and modifying, based on the scene graph, the audio descriptive information to obtain modified audio descriptive information, and rendering, based on the modified audio descriptive information, the at least one audio element to one or more speaker feeds; and outputting the one or more speaker feeds.
  • Example 17E The method of example 16E, further comprising obtaining at least one visual element, wherein constructing the scene graph includes constructing, based on the at least one audio element and the at least one video element, the scene graph that includes a parent node representative of the at least one visual element, and a child node that depends from the parent node and that represents the at least one audio element.
  • Example 18E The method of example 15E, wherein constructing the scene graph includes aligning the at least one visual element and the at least one audio element.
  • Example 19E The method of example 16E-18E, further comprising updating the scene graph to add, remove, or edit the at least one node that represents the at least one audio element.
  • Example 20E The method of example 17E, further comprising mapping, based on visual descriptive information associated with the at least one visual element and the audio descriptive information associated with the at least one audio element, the at least one visual element to the at least one audio element;
  • Example 2 IE The method of example 20E, wherein the visual descriptive information includes a position of the at least one visual element in the extended reality scene, wherein the audio descriptive information includes a position of the at least one audio element in the extended reality scene, and wherein modifying the audio descriptive information comprises modifying, based on the position of the at least one visual element, the position of the at least one audio element to obtain to obtain a modified position of the at least one audio element in the extended reality scene.
  • Example 22E The method of example 2 IE, wherein the modified position of the at least one audio element differs from the position of the at least one audio element.
  • Example 23E The method of any combination of examples 2 IE and 22E, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a rotational angle.
  • Example 24E The method of any combination of examples 21E-23E, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a translational distance.
  • Example 25E The method of any combination of examples 20E-24E, wherein the at least one audio element includes a first audio element and a second audio element, and wherein the method further comprises: mapping, based on visual descriptive information associated with the at least one visual element and audio descriptive information associated with the first audio element, the at least one visual element to the first audio element; determining that none of the at least one visual element maps to the second audio element; and rendering, based on the audio descriptive information associated with the second audio element, the second audio element to the one or more speaker feeds.
  • Example 26E The method of any combination of examples 20E-25E, wherein the visual descriptive information includes an identifier that uniquely identifies the at least one visual element, wherein the audio descriptive information includes an identifier that uniquely identifies the at least one audio element, and wherein mapping the at least one visual element to the at least one audio element comprises mapping, based on the identifier that uniquely identifies the at least one visual element and the identifier that uniquely identifies the at least one audio element, the at least one visual element to the at least one audio element.
  • Example 27E The method of example 26E, wherein the identifier that uniquely identifies the at least one visual element includes one or more of a visual element identifier and a visual element name, and wherein the identifier that uniquely identifies the at least one audio element includes one or more of an audio element identifier and an audio element name.
  • Example 28E The method of any combination of examples 20E-27E, further comprising outputting the modified audio metadata to an audio unit.
  • Example 29E The device of any combination of examples 20E-27E, further comprising output, via an application programming interface exposed by the audio unit, the modified audio metadata to the audio unit.
  • Example 29.5E The device of any combination of examples 20E-27E, wherein the bitstream is transmitted according to one or more of a wireless network protocol, a personal area network protocol, and a cellular network protocol.
  • Example 30E A non-transitory computer-readable medium having stored thereon instructions that, when executed, cause processing circuitry to: obtain a bitstream representative of at least one audio element in an extended reality scene, and audio descriptive information associated with the at least one audio element; and construct, based on the at least one audio element, a scene graph that includes at least one node that represents the at least one audio element; and modify, based on the scene graph, the audio descriptive information to obtain modified audio descriptive information, and render, based on the modified audio descriptive information, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Stereophonic System (AREA)

Abstract

Un dispositif conçu pour traiter un train de bits peut mettre en œuvre les techniques. Le dispositif comprend une mémoire conçue pour stocker le train de bits représentant au moins un élément audio dans une scène de réalité étendue, et des informations audio descriptives associées audit ou auxdits éléments audio. Le dispositif comprend également des circuits de traitement couplés à la mémoire et conçus pour exécuter un gestionnaire de scène et une unité audio. Le gestionnaire de scène est conçu pour construire, sur la base dudit ou desdits éléments audio, un graphe de scène qui comprend au moins un nœud qui représente le ou les éléments audio, et pour modifier, sur la base du graphe de scène, les informations audio descriptives afin d'obtenir des informations audio descriptives modifiées. L'unité audio est conçue pour restituer, sur la base des informations audio descriptives modifiées, le ou les éléments audio sur des flux de haut-parleur, et délivrer le ou les flux de haut-parleur.
PCT/US2023/074583 2022-09-26 2023-09-19 Interface de restitution pour données audio dans des systèmes de réalité étendue WO2024073275A1 (fr)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US202263377169P 2022-09-26 2022-09-26
US63/377,169 2022-09-26
US202363578618P 2023-08-24 2023-08-24
US63/578,618 2023-08-24
US18/467,869 US20240114312A1 (en) 2022-09-26 2023-09-15 Rendering interface for audio data in extended reality systems
US18/467,869 2023-09-15

Publications (1)

Publication Number Publication Date
WO2024073275A1 true WO2024073275A1 (fr) 2024-04-04

Family

ID=88412398

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/074583 WO2024073275A1 (fr) 2022-09-26 2023-09-19 Interface de restitution pour données audio dans des systèmes de réalité étendue

Country Status (1)

Country Link
WO (1) WO2024073275A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190007781A1 (en) 2017-06-30 2019-01-03 Qualcomm Incorporated Mixed-order ambisonics (moa) audio data for computer-mediated reality systems
US20210099773A1 (en) * 2019-10-01 2021-04-01 Qualcomm Incorporated Using gltf2 extensions to support video and audio data
US20220222205A1 (en) * 2021-01-14 2022-07-14 Tencent America LLC Method and apparatus for media scene description

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190007781A1 (en) 2017-06-30 2019-01-03 Qualcomm Incorporated Mixed-order ambisonics (moa) audio data for computer-mediated reality systems
US20210099773A1 (en) * 2019-10-01 2021-04-01 Qualcomm Incorporated Using gltf2 extensions to support video and audio data
US20220222205A1 (en) * 2021-01-14 2022-07-14 Tencent America LLC Method and apparatus for media scene description

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
3GPP TECHNICAL SPECIFICATION (TS) 26.118
3GPP TR 26.998
JASON PETERSON, VIRTUAL REALITY, AUGMENTED REALITY, AND MIXED REALITY DEFINITIONS, 7 July 2017 (2017-07-07)
POLETTI, M: "Three-Dimensional Surround Sound Systems Based on Spherical Harmonics", J. AUDIO ENG. SOC, vol. 53, no. 11, November 2005 (2005-11-01), pages 1004 - 1025
THE KHRONOS 3D FORMATS WORKING GROUP: "glTF(TM) 2.0 Specification", 11 October 2021 (2021-10-11), XP093115398, Retrieved from the Internet <URL:https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html> [retrieved on 20240102] *

Similar Documents

Publication Publication Date Title
US10924876B2 (en) Interpolating audio streams
US11356793B2 (en) Controlling rendering of audio data
US11429340B2 (en) Audio capture and rendering for extended reality experiences
EP4062404B1 (fr) Codage de champ sonore basé sur la priorité pour un contenu audio de réalité virtuelle
US11317236B2 (en) Soundfield adaptation for virtual reality audio
US11089428B2 (en) Selecting audio streams based on motion
US11140503B2 (en) Timer-based access for audio streaming and rendering
US20210006976A1 (en) Privacy restrictions for audio rendering
US11580213B2 (en) Password-based authorization for audio rendering
WO2024081530A1 (fr) Mise à l&#39;échelle de sources audio dans des systèmes de réalité étendue
US20240114312A1 (en) Rendering interface for audio data in extended reality systems
WO2024073275A1 (fr) Interface de restitution pour données audio dans des systèmes de réalité étendue
US11601776B2 (en) Smart hybrid rendering for augmented reality/virtual reality audio
US11750998B2 (en) Controlling rendering of audio data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23789848

Country of ref document: EP

Kind code of ref document: A1