WO2024097485A1 - Codage audio basé sur une scène à faible débit binaire - Google Patents

Codage audio basé sur une scène à faible débit binaire Download PDF

Info

Publication number
WO2024097485A1
WO2024097485A1 PCT/US2023/075621 US2023075621W WO2024097485A1 WO 2024097485 A1 WO2024097485 A1 WO 2024097485A1 US 2023075621 W US2023075621 W US 2023075621W WO 2024097485 A1 WO2024097485 A1 WO 2024097485A1
Authority
WO
WIPO (PCT)
Prior art keywords
metadata
bands
group
spar
dirac
Prior art date
Application number
PCT/US2023/075621
Other languages
English (en)
Inventor
Stefanie Brown
Rishabh Tyagi
Original Assignee
Dolby Laboratories Licensing Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby Laboratories Licensing Corporation filed Critical Dolby Laboratories Licensing Corporation
Publication of WO2024097485A1 publication Critical patent/WO2024097485A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing

Definitions

  • This disclosure relates generally to audio processing.
  • Spatial Reconstruction (SPAR) and Directional Audio Coding (DirAC) are separate spatial audio coding technologies that each seek to represent an input spatial audio scene in a compact way to enable transmission with a good trade-off between audio quality and bitrate.
  • One such input format for a spatial audio scene is a scene -based audio representation (e.g., first-order Ambisonics (FOA) or higher-order Ambisonics (HO A)).
  • FOA first-order Ambisonics
  • HO A higher-order Ambisonics
  • SPAR seeks to maximize perceived audio quality while minimizing bitrate by reducing the energy of the transmitted audio data while still allowing the second-order statistics of the Ambisonics audio scene (i.e., the covariance) to be reconstructed at the decoder side using transmitted metadata. SPAR seeks to faithfully reconstruct the input Ambisonics scene at the output of the decoder.
  • DirAC is a technology which represents spatial audio scenes as a collection of directions of arrival (DOA) in time-frequency tiles. From this representation, a similar-sounding scene can be reproduced in a different output format (e.g., binaural). Notably, in the context of Ambisonics, the DirAC representation allows a decoder to produce higher-order output from low- order input. DirAC seeks to preserve direction and diffuseness of the dominant sounds in the input scene.
  • DOA directions of arrival
  • LBRSBA low bitrate scene -based audio
  • a method of audio metadata encoding comprises: receiving, with at least one processor, scene -based audio metadata; creating, with the at least one processor and from the scene -based audio metadata, Spatial Reconstruction (SPAR) metadata and Directional Audio Coding (DirAC) metadata; forming, with the at least one processor, a group of SPAR metadata bands and a group of DirAC metadata bands; quantizing, with the at least one processor, the group of SPAR metadata bands and the group of DirAC metadata bands; and sending to a decoder: a first data frame including the quantized group of DirAC metadata bands and a first portion of the quantized group of SPAR metadata bands, and a second data frame following the first data frame, the second data frame including the quantized DirAC metadata bands and a second portion of the quantized group of SPAR metadata bands.
  • SPAR Spatial Reconstruction
  • DIrAC Directional Audio Coding
  • the method further comprises sending to the decoder a signal indicating the first data frame or the second data frame.
  • the group of SPAR metadata bands includes four SPAR metadata bands and the group of DirAC bands includes two DirAC metadata bands, and where the group of SPAR bands are lower in frequency then the group of DirAC bands.
  • the group of DirAC metadata bands is sent to the decoder at a first time resolution and the first and second portions of the group of SPAR metadata bands are sent to the decoder at a second time resolution, wherein the second time resolution is lower than the first time resolution.
  • the group of DirAC metadata bands is sent to the decoder at the first time resolution when the first data frame is an initial data frame or the group of DirAC metadata bands is encoded within a metadata bitrate budget.
  • the group of SPAR metadata bands is sent to the decoder at the second time resolution when the group of SPAR metadata bands is not encoded within the metadata bitrate budget.
  • the method further comprises prior to receiving the scenebased audio metadata, applying, with the at least one processor, smoothing to a covariance matrix from which scene-based audio metadata is formed.
  • the covariance smoothing uses a smoothing factor that increases smoothing at low frequency bands and avoids modifying an amount of smoothing in high frequency bands.
  • a method of audio metadata decoding comprises: receiving, with at least one processor, quantized scene-based audio data and corresponding metadata, the metadata including decorrelator coefficients; dequantizing, with the at least one processor, the quantized scene-based audio data and corresponding metadata; decoding, with the at least one processor, the scene -based audio data and corresponding metadata, the decoding including recovering the decorrelator coefficients; smoothing, with the at least one processor, the decorrelator coefficients; and reconstructing, with the at least one processor, a multichannel audio signal based on at least the decoded scene -based audio data and the smoothed decorrelator coefficients.
  • FIG. 1 is a block diagram of an IVAS codec framework, according to one or more embodiments.
  • FIG. 2 is a flow diagram of covariance smoothing process, according to one or more embodiments.
  • FIG. 3 is a flow diagram of an example modification to the process shown in FIG. 2 for a maximum permitted forgetting factor, according to one or more embodiments.
  • FIG. 4 is a flow diagram of an example modification to transient detection process flow, according to one or more embodiments.
  • FIG. 5 is a plot of decorrelation coefficients over n frames, according to one or more embodiments.
  • FIG. 6 is a flow diagram of LBRSBA (e.g., Ambisonics) processing, according to one or more embodiments.
  • LBRSBA e.g., Ambisonics
  • FIG. 7 is a block diagram of an example hardware architecture suitable for implementing the systems and methods described in reference to FIGS. 1-6.
  • connecting elements such as solid or dashed lines or arrows
  • the absence of any such connecting elements is not meant to imply that no connection, relationship, or association can exist.
  • some connections, relationships, or associations between elements are not shown in the drawings so as not to obscure the disclosure.
  • a single connecting element is used to represent multiple connections, relationships or associations between elements.
  • a connecting element represents a communication of signals, data, or instructions
  • such element represents one or multiple signal paths, as may be needed, to affect the communication.
  • the term “includes,” and its variants are to be read as open-ended terms that mean “includes but is not limited to.”
  • the term “or” is to be read as “and/or” unless the context clearly indicates otherwise.
  • the term “based on” is to be read as “based at least in part on.”
  • the term “one example implementation” and “an example implementation” are to be read as “at least one example implementation.”
  • the term “another implementation” is to be read as “at least one other implementation.”
  • the terms “determined,” “determines,” or “determining” are to be read as obtaining, receiving, computing, calculating, estimating, predicting or deriving.
  • all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skills in the art to which this disclosure belongs.
  • FIG. 1 is a block diagram of an immersive voice and audio services (IVAS) coder/decoder (“codec”) framework 100 for encoding and decoding IVAS bitstreams, according to one or more embodiments.
  • IVAS is expected to support a range of audio service capabilities, including but not limited to mono to stereo upmixing and fully immersive audio encoding, decoding and rendering.
  • IVAS is also intended to be supported by a wide range of devices, endpoints, and network nodes, including but not limited to: mobile and smart phones, electronic tablets, personal computers, conference phones, conference rooms, virtual reality (VR) and augmented reality (AR) devices, home theatre devices, and other suitable devices.
  • IVAS codec 100 includes IVAS encoder 101 and IVAS decoder 104.
  • IVAS encoder 101 includes spatial encoder 102 that receives N channels of input spatial audio (e.g., FOA, HOA).
  • spatial encoder 102 implements SPAR and DirAC for analyzing/downmixing N_dmx spatial audio channels, as described in further detail below.
  • the output of spatial encoder 102 includes a spatial metadata (MD) bitstream (BS) and N_dmx channels of spatial downmix.
  • the spatial MD is quantized and entropy coded.
  • quantization can include fine, moderate, coarse and extra coarse quantization strategies and entropy coding can include Huffman or Arithmetic coding.
  • the spatial encoder permits not more than 3 levels of quantization at a given operating mode; however, with decreasing bitrates, the three levels become increasingly coarser overall, to meet bitrate requirements.
  • SCE Single Channel Element
  • IVAS decoder 104 includes core audio decoder 105 (e.g., Single Channel Element (SCE)) that decodes the audio bitstream extracted from the IVAS bitstream to recover the N_dmx audio channels.
  • core audio decoder 105 e.g., Single Channel Element (SCE)
  • Spatial decoder/renderer 106 e.g., SPAR/DirAC
  • LBRSBA e.g., Ambisonics
  • SPAR-DIRAC codec it is desirable to implement LBRSBA (e.g., Ambisonics) using a SPAR-DIRAC codec.
  • LBRSBA can be achieved using one or more of the following techniques: 1) reduced MD bitrate and band interleaving; 2) extra covariance smoothing to facilitate the reduced MD bitrate; and 3) decoder side decorrelator coefficient smoothing.
  • low bitrate is achieved by operating with fewer frequency bands (e.g., 6 bands instead of 12 bands) to reduce the amount of spatial metadata that is transported from the encoder to the decoder.
  • the bottom 4 bands (4 lower frequency bands) are allocated to SPAR and the top 2 bands (2 higher frequency bands) are allocated to Dir AC.
  • Table I below illustrates example band allocation when going from 12 bands to 6 bands in a particular embodiment:
  • the SPAR bands (bands 0-7) are reduced to a LBRSBA band group with four bands (bands 0-3) and DirAC bands (8-11) are reduced to a LBRSBA band group with two bands (bands 4 and 5) for a total of 6 LBRSBA bands.
  • the time -resolution of the metadata is also reduced.
  • DirAC metadata is often calculated at a 5ms resolution while SPAR metadata is calculated at a 20ms resolution.
  • a slower update rate of metadata is used.
  • SPAR metadata moves to a 40ms update rate, with occasional 20ms updates, where permissible by bitrate limitations.
  • the DirAC metadata bitrate remains at 20ms resolution, (compared to a DirAC baseline), or is dropped from 5ms to 20ms compared to the higher bitrates used in non-LBRSBA operation.
  • only SPAR MD is reduced to band groups.
  • more or fewer SPAR or DirAC bands can be grouped together and/ or there can be more than two groups of bands.
  • all SPAR and DirAC bands are sent to the decoder in the frame.
  • a first portion e.g., a first half
  • a second data frame that includes a second portion (e.g., a second half) of the group of SPAR metadata bands, and so on.
  • the choice of which bands to send or omit in each frame is interleaved. For example, when band metadata is omitted for a frame, it is assumed to be the same metadata as that band for the previous frame. The advantage of this interleaving approach is that more frames are generated with (relatively) finely quantized metadata, at the cost of time resolution. This significantly reduces the metadata bitrate, leaving more bits for the core coder.
  • an indication of what type of frame has been coded is achieved by reusing existing SPAR metadata bitstream signaling for time differential coding of metadata.
  • Table II below lists example coding schemes for non-LBRSBA and LBRSBA coding: Table II -Examples of Non-LBRSBA and LBRSBA Coding Schemes
  • BASE indicates entropy coding using an arithmetic coder.
  • FOUR_X indicates time differential coding of some bands using the original arithmetic, and a time-differential arithmetic coder.
  • a Huffman coder is used.
  • the metadata for unsent bands is held at that band’s value from the previous frame where it was sent.
  • the best-case recovery is one frame, if BASE or BASE_NOEC frame is used in the subsequent frame, or two frames (a successive A and B frame), otherwise.
  • Table III lists examples of IVAS SBA (Ambisonics) bitrates including LBRSBA bitrates.
  • extra covariance smoothing is applied to the LBRSBA covariance matrix to further reduce the SPAR metadata bitrate and improve single channel element (SCE) core decisions (e.g., ACELP/TCX) used to code the spatial downmix channel.
  • SCE single channel element
  • Covariance smoothing is described in US Patent Publication No. 2022/0277757 for “Systems and methods for covariance smoothing,” but the smoothing factor has been modified for LBRSBA as described below. Smoothing is applied to the covariance matrix before the MD is received or computed.
  • a frequency-domain representation of the audio is used to generate the covariance matrix which is smoothed using the covariance smoothing technique described below.
  • the SPAR and DirAC metadata are formed using the smoothed covariance matrix, grouped into LBRSBA bands as shown in Table I, quantized and sent to the decoder. i. Smoothing Eunction and Eorgetting Eactor
  • Covariance smoothing utilizes a smoothed matrix.
  • a smoothed matrix can be calculated using a low-pass filter designed to meet particular smoothing requirements.
  • the smoothing requirements are such that previous estimates are used to artificially increase the number of frequency samples (bins) used to generate the current estimate of a covariance matrix.
  • calculating the smoothed matrix A from an input covariance matrix A over a frame sequence uses a first order auto-regressive low pass filter that uses a weighted sum of past and present frames' estimated matrix values: where A is a forgetting factor, or an update rate, i.e., how much emphasis is placed on previous estimation data and n is the frame number.
  • Equation [1] is one example of a smoothing function that is a first order low pass filter.
  • Other smoothing functions can also be used, such as a higher order filter.
  • the important factors of the smoothing function are the looking -back aspect of using previously smoothed results and the forgetting factor to give weight to the influence of those results.
  • the effect of the forgetting factor is that, as the smoothing is applied over successive frames, the effect of previous frames becomes less and less impactful on the smoothing of the frame being smoothed (adjusted).
  • the equation acts as a low pass filter.
  • the lower A places more emphasis on the old covariance e data, while a higher A takes more of the new covariance into account.
  • a forgetting factor over one (e.g., 1 ⁇ A ⁇ 2) implements as a high pass filter.
  • a maximum permissible forgetting factor X max is implemented. This maximum value will determine the behavior of the algorithm once the bins/band values become large.
  • the forgetting factor for a particular band A b is calculated as the minimum of the maximum permitted forgetting factor A max and the ratio of the effective number of bins in the band N b and the minimum number of bins N min that are determined to give a good statistical estimate based on the window size: [0051]
  • N b is the actual count of bins for the frequency band.
  • ⁇ max 1 such that ⁇ b stays within a reasonable range, e.g., 0 ⁇ ⁇ b ⁇ 1. This means that smoothing is applied proportionally to small sample estimates, and no smoothing is applied at all to large sample estimates.
  • N min can be selected based on the data at hand that produces the best subjective results. In some embodiments, N min can be selected based on how much initial (first subsequent frame after the initial frame of a given window) smoothing is desired.
  • FIG. 2 is a flow diagram of a covariance smoothing process, according to one or more embodiments.
  • An input frequency-domain signal e.g., Fast Fourier transform (FFT)
  • FFT Fast Fourier transform
  • An effective count of the bins for that band is taken 202. This can be, for example, calculated by the filterbank response values of the band.
  • a desired bin count is determined 203, for example by a subjective analysis of how many bins would be needed to provide a good statistical analysis for the window.
  • a forgetting factor is computed 204 by taking a ratio of the calculated number of bins to the desired bin count.
  • a new covariance matrix value is computed 205 based on the new covariance value computed for the previous frame, the original value for the current frame, and the forgetting factor.
  • the new (smoothed) matrix formed by these new values is used in further signal processing 206.
  • FIG. 3 shows an example modification to process 200 for a maximum permitted forgetting factor, according to one or more embodiments.
  • a forgetting factor is computed 301 for the band.
  • a maximum permitted forgetting factor is determined 302.
  • the values are compared 303, and in response to the calculated factor being less than the maximum permitted factor, then the calculated factor is used in the smoothing 305 (hereinafter, “smoothing_factor”). If the calculated factor is greater than the maximum permitted factor, the maximum permitted factor is used 304 in the smoothing 305.
  • smoothing_factor the smoothing_factor
  • the smoothing factor is different depending on whether the codec is operating at non-LBRSBA or LBRSBA.
  • the smoothing can be “reset” at points where transients are detected in the signal.
  • FIG. 4 is a flow diagram of a process for modifying the transient detection process flow, according to one or more embodiments.
  • a determination is made 401 if a transient is detected for a given frame. If it is 403, then the new matrix value remains the same as the input value. If not 402, the usual smoothing algorithm is used for that frame.
  • the combination (matrix) of smoothed and non-smoothed (transient) frame values are used for signal processing 404.
  • the smoothing is reset when a transient is detected on any channel.
  • N transient detectors can be used (one per channel) and if any of them detect a transient, the smoothing is reset or end of signal or end of smoothing (smoothing is turned off).
  • the channels may be determined to be distinct (or possibly distinct) enough such that only considering transients in the left channel might mean an important transient in the right channel may be improperly smoothed (and vice versa). Therefore, two transient detectors are used (left and right) and either one of these can trigger a smoothing reset of the entire 2x2 matrix.
  • the smoothing is only reset on transients for certain channels. For example, if there are N channels, only M ( ⁇ N, possibly 1) detectors are used.
  • the first (W) channel can be determined to be the most important compared to the other three (X, Y, Z) and, given the spatial relationships between FOA signals, transients in the latter three channels will likely be reflected in the W channel anyway. Therefore, the system can be set up with a transient detector only on the W channel, triggering a reset of the entire 4x4 covariance smoothing matrix when it detects a transient on W.
  • the reset only resets covariance elements that have experienced the transient. This would mean that a transient in the n th channel would only reset values in the n th row and in the n th column of the covariance matrix (entire row and entire column). This can be performed by having separate transient monitoring on each channel and a detected transient on any given channel would trigger a reset for matrix positions that correspond to that channel’s covariance to another channel (and vice versa, and, trivially, to itself).
  • the reset only occurs on a majority /threshold number of channels detecting a transient.
  • the threshold could be set to trigger a reset only if at least two of the channels report a transient in the same frame.
  • band-selective covariance smoothing resetting is implemented. While covariance smoothing resetting functionality helps to allow the covariance to move quickly in cases where transients occur, in cases where some bands are heavily smoothed, e.g., at lowest frequencies, rapid repeated detected transients and subsequent resetting of the covariance smoothing sometimes creates an audible stuttering effect. By selectively resetting bands with less smoothing, this effect can be minimized/avoided.
  • decoder side decorrelator coefficient smoothing can help to prevent this effect from being audible.
  • the smoothing is mathematically equivalent to what is used for covariance smoothing described above, except without the ability to reset smoothing on transients.
  • a forgetting factor of 0.5 can be used for all bands, though other values are also possible.
  • FIG. 5 are plots of smoothed and unsmoothed quantized decorrelation coefficients over n frames, according to one or more embodiments.
  • a first plot 501 shows an example of unsmoothed quantized decorrelation coefficients at the decoder with three possible levels (0.0. 0.4, 0.8).
  • a second plot 502 shows an example of smoothed quantized decorrelator coefficients.
  • FIG. 6 is a flow diagram of LBRSBA Ambisonics processing, according to one or more embodiments.
  • Process 600 can be implemented using the electronic device architecture described in reference to FIG. 7.
  • process 600 includes: receiving scene based audio metadata (601); creating from the scene based audio metadata, Spatial Reconstruction (SPAR) metadata and Directional Audio Coding (DirAC) metadata (602); forming a group of SPAR metadata bands and a group of DirAC metadata bands (603); quantizing the group of SPAR metadata bands and the group of DirAC metadata bands (604); sending to a decoder: a first data frame including the quantized group of DirAC metadata bands and a first portion of the quantized group of SPAR metadata bands, and a second data frame following the first data frame, the second data frame including the quantized DirAC metadata bands and a second portion of the quantized group of SPAR metadata bands (605).
  • SSR Spatial Reconstruction
  • DirAC Directional Audio Coding
  • FIG. 7 shows a block diagram of an example electronic device architecture 700 suitable for implementing example embodiments of the present disclosure.
  • Architecture 700 includes but is not limited to servers and client devices, as previously described in reference to FIGS. 1-6.
  • the architecture 700 includes central processing unit (CPU) 701 which is capable of performing various processes in accordance with a program stored in, for example, read only memory (ROM) 702 or a program loaded from, for example, storage unit 708 to random access memory (RAM) 703.
  • ROM read only memory
  • RAM random access memory
  • RAM 703 the data required when CPU 701 performs the various processes is also stored, as required.
  • CPU 701, ROM 702 and RAM 703 are connected to one another via bus 804.
  • Input/output (RO) interface 705 is also connected to bus 704.
  • I/O interface 705 input unit 706, that may include a keyboard, a mouse, or the like; output unit 707 that may include a display such as a liquid crystal display (ECD) and one or more speakers; storage unit 708 including a hard disk, or another suitable storage device; and communication unit 709 including a network interface card such as a network card (e.g., wired or wireless).
  • input unit 706, that may include a keyboard, a mouse, or the like
  • output unit 707 that may include a display such as a liquid crystal display (ECD) and one or more speakers
  • storage unit 708 including a hard disk, or another suitable storage device
  • communication unit 709 including a network interface card such as a network card (e.g., wired or wireless).
  • input unit 706 includes one or more microphones in different positions (depending on the host device) enabling capture of audio signals in various formats (e.g., mono, stereo, spatial, immersive, and other suitable formats).
  • various formats e.g., mono, stereo, spatial, immersive, and other suitable formats.
  • output unit 707 include systems with various number of speakers.
  • Output unit 707 (depending on the capabilities of the host device) can render audio signals in various formats (e.g., mono, stereo, immersive, binaural, and other suitable formats).
  • communication unit 709 is configured to communicate with other devices (e.g., via a network).
  • Drive 710 is also connected to I/O interface 705, as required.
  • Removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a flash drive or another suitable removable medium is mounted on drive 710, so that a computer program read therefrom is installed into storage unit 708, as required.
  • the processes described above may be implemented as computer software programs or on a computer-readable storage medium.
  • embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine readable medium, the computer program including program code for performing methods.
  • the computer program may be downloaded and mounted from the network via the communication unit 709, and/or installed from the removable medium 711 , as shown in FIG. 7.
  • control circuitry e.g., CPU 701 in combination with other components of FIG. 7
  • the control circuitry may be performing the actions described in this disclosure.
  • Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device (e.g., control circuitry).
  • a machine -readable medium may be any tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine- readable signal medium or a machine -readable storage medium.
  • a machine-readable medium may be non-transitory and may include but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
  • machine-readable storage medium More specific examples of the machine-readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random-access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
  • RAM random-access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • CD-ROM portable compact disc read-only memory
  • magnetic storage device or any suitable combination of the foregoing.
  • Computer program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These computer program codes may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus that has control circuitry, such that the program codes, when executed by the processor of the computer or other programmable data processing apparatus, cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented.
  • the program code may execute entirely on a computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server or distributed over one or more remote computers and/or servers.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Conformément à des modes de réalisation, la présente invention concerne le codage audio basé sur une scène à très faible débit binaire (LBRSBA) avec une SPAR et un DIRAC combinés. Dans certains modes de réalisation, un procédé consiste à : recevoir des métadonnées audio basées sur une scène ; créer, à partir des métadonnées audio basées sur une scène, des métadonnées de reconstruction spatiale (SPAR) et des métadonnées de codage audio directionnel (DirAC) ; former un groupe de bandes de métadonnées de SPAR et un groupe de bandes de métadonnées de DirAC ; quantifier le groupe de bandes de métadonnées de SPAR et le groupe de bandes de métadonnées de DirAC ; et envoyer à un décodeur : une première trame de données comprenant le groupe quantifié de bandes de métadonnées de DirAC et une première partie du groupe quantifié de bandes de métadonnées de SPAR, et une seconde trame de données suivant la première trame de données, la seconde trame de données comprenant les bandes de métadonnées de DirAC quantifiées et une seconde partie du groupe quantifié de bandes de métadonnées de SPAR.
PCT/US2023/075621 2022-10-31 2023-09-29 Codage audio basé sur une scène à faible débit binaire WO2024097485A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202263421045P 2022-10-31 2022-10-31
US63/421,045 2022-10-31
US202363582950P 2023-09-15 2023-09-15
US63/582,950 2023-09-15

Publications (1)

Publication Number Publication Date
WO2024097485A1 true WO2024097485A1 (fr) 2024-05-10

Family

ID=88731504

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/075621 WO2024097485A1 (fr) 2022-10-31 2023-09-29 Codage audio basé sur une scène à faible débit binaire

Country Status (1)

Country Link
WO (1) WO2024097485A1 (fr)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9978385B2 (en) 2013-10-21 2018-05-22 Dolby International Ab Parametric reconstruction of audio signals
WO2021022087A1 (fr) * 2019-08-01 2021-02-04 Dolby Laboratories Licensing Corporation Codage et décodage de flux binaires ivas
WO2021130404A1 (fr) * 2019-12-23 2021-07-01 Nokia Technologies Oy Fusion de paramètres audio spatiaux
WO2021252811A2 (fr) 2020-06-11 2021-12-16 Dolby Laboratories Licensing Corporation Quantification et codage entropique de paramètres pour un codec audio à faible latence
WO2021252748A1 (fr) 2020-06-11 2021-12-16 Dolby Laboratories Licensing Corporation Codage de signaux audio multicanaux comprenant le sous-mixage d'un canal d'entrée primaire et d'au moins deux canaux d'entrée non primaires mis à l'échelle
WO2022120093A1 (fr) 2020-12-02 2022-06-09 Dolby Laboratories Licensing Corporation Services vocaux et audio immersifs (ivas) avec stratégies de mélange abaisseur adaptatives
US20220277757A1 (en) 2019-08-01 2022-09-01 Dolby Laboratories Licensing Corporation Systems and methods for covariance smoothing
US20220406318A1 (en) 2019-10-30 2022-12-22 Dolby Laboratories Licensing Corporation Bitrate distribution in immersive voice and audio services
WO2023063769A1 (fr) 2021-10-15 2023-04-20 (주)지노믹트리 Construction génique pour l'expression d'arnm

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9978385B2 (en) 2013-10-21 2018-05-22 Dolby International Ab Parametric reconstruction of audio signals
WO2021022087A1 (fr) * 2019-08-01 2021-02-04 Dolby Laboratories Licensing Corporation Codage et décodage de flux binaires ivas
US20220277757A1 (en) 2019-08-01 2022-09-01 Dolby Laboratories Licensing Corporation Systems and methods for covariance smoothing
US20220406318A1 (en) 2019-10-30 2022-12-22 Dolby Laboratories Licensing Corporation Bitrate distribution in immersive voice and audio services
WO2021130404A1 (fr) * 2019-12-23 2021-07-01 Nokia Technologies Oy Fusion de paramètres audio spatiaux
WO2021252811A2 (fr) 2020-06-11 2021-12-16 Dolby Laboratories Licensing Corporation Quantification et codage entropique de paramètres pour un codec audio à faible latence
WO2021252748A1 (fr) 2020-06-11 2021-12-16 Dolby Laboratories Licensing Corporation Codage de signaux audio multicanaux comprenant le sous-mixage d'un canal d'entrée primaire et d'au moins deux canaux d'entrée non primaires mis à l'échelle
WO2022120093A1 (fr) 2020-12-02 2022-06-09 Dolby Laboratories Licensing Corporation Services vocaux et audio immersifs (ivas) avec stratégies de mélange abaisseur adaptatives
WO2023063769A1 (fr) 2021-10-15 2023-04-20 (주)지노믹트리 Construction génique pour l'expression d'arnm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LASSE LAAKSONEN ET AL: "DRAFT TS 26.253 (Codec for Immersive Voice and Audio Services, Detailed Algorithmic Description incl. RTP payload format and SDP parameter definitions)", vol. 3GPP SA 4, no. Chicago, US; 20231113 - 20231117, 7 November 2023 (2023-11-07), XP052546126, Retrieved from the Internet <URL:https://www.3gpp.org/ftp/TSG_SA/WG4_CODEC/TSGS4_126_Chicago/Docs/S4-231842.zip S4-231842 draft_TS26.253_v001.docx> [retrieved on 20231107] *

Similar Documents

Publication Publication Date Title
TWI752281B (zh) 用以使用量化及熵寫碼來編碼或解碼方向性音訊寫碼參數之設備及方法
KR101790641B1 (ko) 하이브리드 파형-코딩 및 파라미터-코딩된 스피치 인핸스
US20220406318A1 (en) Bitrate distribution in immersive voice and audio services
KR102492119B1 (ko) 오디오 코딩/디코딩 모드를 결정하는 방법 및 관련 제품
EP4008000A1 (fr) Codage et décodage de flux binaires ivas
WO2019170955A1 (fr) Codage audio
EP3881560A1 (fr) Représentation d&#39;audio spatial au moyen d&#39;un signal audio et métadonnées associées
US20240153512A1 (en) Audio codec with adaptive gain control of downmixed signals
WO2024097485A1 (fr) Codage audio basé sur une scène à faible débit binaire
US20220293112A1 (en) Low-latency, low-frequency effects codec
US10916255B2 (en) Apparatuses and methods for encoding and decoding a multichannel audio signal
KR20230023760A (ko) 일차 및 둘 이상의 스케일링된 비-일차 입력 채널들의 다운믹싱을 포함하는 멀티-채널 오디오 신호들의 인코딩
TW202347317A (zh) 用於方向性音訊寫碼空間重建音訊處理之方法、設備及系統
CN116982109A (zh) 具有下混信号自适应增益控制的音频编解码器
TW202411984A (zh) 用於具有元資料之參數化經寫碼獨立串流之不連續傳輸的編碼器及編碼方法
TW202410024A (zh) 編碼及解碼浸入式語音及音訊服務位元流之方法、系統及非暫時性電腦可讀媒體
EP4256557A1 (fr) Remplissage de bruit spatial dans un codec à multiples canaux