CN116529815A

CN116529815A - Apparatus and method for encoding a plurality of audio objects and apparatus and method for decoding using two or more related audio objects

Info

Publication number: CN116529815A
Application number: CN202180076553.3A
Authority: CN
Inventors: 安德里亚·艾肯塞尔; 斯里坎特·科斯; 斯特凡·拜尔; 法比恩·屈希; 奥利弗·迪尔加特; 纪尧姆·福克斯; 多米尼克·韦克贝克; 于尔根·赫勒; 马库斯·马特拉斯
Original assignee: Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV
Current assignee: Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV
Priority date: 2020-10-13
Filing date: 2021-10-12
Publication date: 2023-08-01

Abstract

An apparatus for encoding a plurality of audio objects, comprising: an object parameter calculator (100) configured to: calculating parameter data of at least two related audio objects for one or more of a plurality of frequency bins related to the time frame, wherein the number of the at least two related audio objects is lower than the total number of the plurality of audio objects; and an output interface (200) configured to output an encoded audio signal comprising information about parametric data of at least two related audio objects of one or more frequency bins.

Description

Device and method for encoding multiple audio objects and device and method for decoding using two or more related audio objects

技术领域Technical Field

本发明涉及音频信号(例如，音频对象)的编码和编码音频信号(例如，编码音频对象)的解码。The present invention relates to encoding of audio signals (eg audio objects) and decoding of encoded audio signals (eg encoded audio objects).

背景技术Background Art

引言introduction

本文档描述了一种使用定向音频编码(DirAC)以低比特率对基于对象的音频内容进行编码和解码的参数化方法。所呈现的实施例用作3GPP沉浸式语音和音频服务(IVAS)编解码器的一部分，并且其中提供了对低比特率的具有元数据的独立流(ISM)模式(一种离散编码方法)的有利替代。This document describes a parametric approach to encoding and decoding object-based audio content at low bitrates using Directional Audio Coding (DirAC). The presented embodiments are used as part of the 3GPP Immersive Voice and Audio Services (IVAS) codec and provide an advantageous alternative to the low bitrate Independent Streams with Metadata (ISM) mode, a discrete encoding method.

现有技术Prior art

对象的离散编码Discrete encoding of objects

对基于对象的音频内容进行编码的最直接方法是单独地编码并将对象和对应的元数据一起发送。该方法的主要缺点是：随着对象数量的增加，对对象进行编码所需的比特消耗过高。该问题的简单解决方案是采用“参数化方法”，其中，一些相关参数是根据输入信号计算的，与组合若干个对象波形的合适下混信号一起进行量化和发送。The most straightforward way to encode object-based audio content is to encode and transmit the objects individually together with the corresponding metadata. The main drawback of this approach is that the bit consumption required to encode the objects becomes prohibitively high as the number of objects increases. A simple solution to this problem is to adopt a "parametric approach", where a few relevant parameters are calculated from the input signal, quantized and transmitted together with a suitable downmix signal combining several object waveforms.

空间音频对象编码(SAOC)Spatial Audio Object Coding (SAOC)

空间音频对象编码[SAOC_STD、SAOC_AES]是一种参数化方法，其中，编码器基于某个下混矩阵D和参数集来计算下混信号，并将这两者发送给解码器。这些参数表示所有各个对象的心理声学相关属性和关系。在解码器处，使用渲染矩阵R将下混渲染到特定扬声器布局。Spatial Audio Object Coding [SAOC_STD, SAOC_AES] is a parametric approach where the encoder computes a downmix signal based on some downmix matrix D and a set of parameters and sends both to the decoder. These parameters represent the psychoacoustically relevant properties and relationships of all the individual objects. At the decoder, the downmix is rendered to a specific speaker layout using a rendering matrix R.

SAOC的主要参数是大小为N*N的对象协方差矩阵E，其中，N是指对象的数量。将该参数作为对象级别差异(OLD)和可选的对象间协方差(IOC)传输给解码器。The main parameter of SAOC is the object covariance matrix E of size N*N, where N refers to the number of objects. This parameter is transmitted to the decoder as object level difference (OLD) and optional inter-object covariance (IOC).

矩阵E的各个元素e_i，j由下式给出：The elements e _i,j of the matrix E are given by:

对象级别差异(OLD)被定义为Object-level difference (OLD) is defined as

其中，和绝对对象能量(NRG)被描述为in, and the absolute object energy (NRG) is described as

以及as well as

其中，i和j分别是对象x_i和x_j的对象索引，n指示时间索引，以及k指示频率索引。l指示时间索引集，并且m指示频率索引集。ε是避免被零除的附加常数，例如ε＝10。Wherein, i and j are object indices of objects x _i and x _j respectively, n indicates a time index, and k indicates a frequency index. l indicates a time index set, and m indicates a frequency index set. ε is an additional constant to avoid division by zero, for example, ε=10.

输入对象(IOC)的相似度测量值可以例如由互相关给出：The similarity measure of the input objects (IOC) can be given, for example, by the cross-correlation:

大小为N_dmx*N的下混矩阵D由元素d_i，j来定义，其中，i是指下混信号的通道索引，并且j是指对象索引。对于立体声下混(N_dmx＝2)，d_i，j根据参数DMG和DCLD被计算为The downmix matrix D of size N_dmx*N is defined by the elements d _i,j , where i refers to the channel index of the downmix signal and j refers to the object index. For a stereo downmix (N_dmx=2), d _i,j is calculated from the parameters DMG and DCLD as

其中，DMG_i和DCLD_i由下式给出：Where DMG _i and DCLD _i are given by:

对于单声道下混(N_dmx＝1)情况，d_i，j仅根据DMG参数被计算为For the mono downmix case (N_dmx=1), d _i,j is calculated based on the DMG parameters only as

其中，in,

空间音频对象编码-3D(SAOC-3D)Spatial Audio Object Coding-3D (SAOC-3D)

空间音频对象编码3D音频再现(SAOC-3D)[MPEGH_AES、MPEGH_IEEE、MPEGH_STD、SAOC_3D_PAT]是上述MPEG SAOC技术的扩展，MPEG SAOC技术以比特率非常高效的方式对通道和对象信号进行压缩和渲染。Spatial Audio Object Coding 3D Audio Rendering (SAOC-3D) [MPEGH_AES, MPEGH_IEEE, MPEGH_STD, SAOC_3D_PAT] is an extension of the above-mentioned MPEG SAOC technology, which compresses and renders channel and object signals in a very bit rate efficient way.

与SAOC的主要差异是：The main differences from SAOC are:

·虽然原始SAOC仅支持多达两个下混通道，但SAOC-3D可以将多对象输入映射到任意数量的下混通道(和关联的辅助信息)。While the original SAOC only supports up to two downmix channels, SAOC-3D can map multi-object input to an arbitrary number of downmix channels (and associated side information).

·与已经使用MPEG环绕声作为多通道输出处理器的经典SAOC相比，直接渲染到多通道输出。· Direct rendering to multi-channel output, compared to classic SAOC which already used MPEG Surround as multi-channel output processor.

·丢弃了一些工具，例如残差编码工具。Some tools, such as residual coding tools, were dropped.

尽管存在这些差异，但从参数角度来看，SAOC-3D与SAOC相同。SAOC-3D解码器——类似于SAOC解码器——接收多通道下混X、协方差矩阵E、渲染矩阵R和下混矩阵D。Despite these differences, SAOC-3D is identical to SAOC from a parameter perspective. The SAOC-3D decoder - similar to the SAOC decoder - receives the multi-channel downmix X, the covariance matrix E, the rendering matrix R and the downmix matrix D.

渲染矩阵R由输入通道和输入对象来定义，并且分别从格式转换器(通道)和对象渲染器(对象)接收。The rendering matrix R is defined by the input channels and input objects, and is received from the format converter (channel) and the object renderer (object), respectively.

下混矩阵D由元素d_i，j来定义，其中，i是指下混信号的通道索引，并且j是指对象索引并根据下混增益(DMG)进行计算：The downmix matrix D is defined by elements d _i,j , where i refers to the channel index of the downmix signal and j refers to the object index and is calculated from the downmix gain (DMG):

其中，in,

大小为N_out*N_out的输出协方差矩阵C被定义为：The output covariance matrix C of size N_out*N_out is defined as:

C＝RER^* C＝RER ^*

相关方案Related Programs

存在与SAOC本质上相似但存在以下细微差异的若干个其他方案：There are several other schemes that are similar in nature to SAOC, but with the following minor differences:

·针对对象的双耳提示编码(BCC)已经在例如[BCC2001]中进行了描述，并且是SAOC技术的前身。• Binaural Cue Coding (BCC) for objects has been described in, for example, [BCC2001] and is a predecessor of the SAOC technique.

·联合对象编码(JOC)和高级联合对象编码(A-JOC)执行与SAOC类似的功能，同时在解码器侧提供大致分离的对象，而不将它们渲染到特定输出扬声器布局[JOC_AES，AC4_AES]。该技术将上混矩阵的元素从下混发送给分离对象作为参数Joint Object Coding (JOC) and Advanced Joint Object Coding (A-JOC) perform a similar function to SAOC while providing roughly separated objects on the decoder side without rendering them to a specific output speaker layout [JOC_AES, AC4_AES]. This technique sends the elements of the upmix matrix from the downmix to the separated objects as parameters

(而不是OLD)。(instead of OLD).

定向音频编码(DirAC)Directional Audio Coding (DirAC)

另一种参数化方法是定向音频编码。DirAC[Pulkki2009]是空间声音的感知驱动再现。假设：在一个时间实例并针对一个临界频带，人类听觉系统的空间分辨率仅限于对一个方向线索和另一耳间相干性线索进行解码。Another parameterized approach is directional audio coding. DirAC [Pulkki 2009] is a perceptually driven reproduction of spatial sound. Assumption: At one time instance and for one critical frequency band, the spatial resolution of the human auditory system is limited to decoding a directional cue and another interaural coherence cue.

基于这些假设，DirAC通过淡入淡出两个流来表示一个频带中的空间声音：非定向扩散流和定向非扩散流。DirAC处理分两个阶段执行：分析和合成，如图12a和图12b所示。Based on these assumptions, DirAC represents spatial sound in a frequency band by fading in and out two streams: a non-directional diffuse stream and a directional non-diffuse stream. The DirAC processing is performed in two stages: analysis and synthesis, as shown in Figures 12a and 12b.

在DifAC分析阶段，将B格式的一阶重合麦克风视为输入，并在频域中分析声音的扩散度和到达方向。In the DifAC analysis stage, a B-format first-order coincident microphone is considered as input and the diffuseness and arrival direction of the sound are analyzed in the frequency domain.

在DifAC合成级中，声音被划分为两个流，即非扩散流和扩散流。使用幅度平移(panning)将非扩散流再现为点源，这可以通过使用向量基幅度平移(VBAP)[Pulkki1997]来进行。扩散流负责产生包围感，并且是通过向扬声器传送相互解相关信号来产生的。In the DifAC synthesis stage, the sound is divided into two streams, a non-diffuse stream and a diffuse stream. The non-diffuse stream is reproduced as a point source using amplitude panning, which can be done by using vector basis amplitude panning (VBAP) [Pulkki1997]. The diffuse stream is responsible for the sense of envelopment and is produced by sending mutually decorrelated signals to the speakers.

图12a中的分析级包括带滤波器1000、能量估计器1001、强度估计器1002、时间平均元件999a和999b、扩散度计算器1003和方向计算器1004。所计算的空间参数是每个时间/频率区的介于0和1之间的扩散度值和由块1004生成的每个时间/频率区的到达方向参数。在图12a中，方向参数包括方位角和仰角，该方位角和仰角指示声音相对于参考或收听位置并且具体地相对于麦克风所在位置的到达方向，从该位置收集输入到带滤波器1000的四个分量信号。在图12a中，这些分量信号是一阶环绕声分量，该一阶环绕声分量包括全向分量W、方向分量X、另一方向分量Y和又一方向分量Z。The analysis stage in Figure 12a includes a band filter 1000, an energy estimator 1001, an intensity estimator 1002, time averaging elements 999a and 999b, a diffuseness calculator 1003 and a direction calculator 1004. The calculated spatial parameters are a diffuseness value between 0 and 1 for each time/frequency bin and a direction of arrival parameter for each time/frequency bin generated by block 1004. In Figure 12a, the direction parameters include an azimuth and an elevation indicating the direction of arrival of the sound relative to a reference or listening position and specifically relative to the location of the microphone from which the four component signals input to the band filter 1000 are collected. In Figure 12a, these component signals are first order surround sound components including an omnidirectional component W, a directional component X, another directional component Y and yet another directional component Z.

图12b所示的DirAC合成级包括用于生成B格式麦克风信号W、X、Y、Z的时间/频率表示的带通滤波器1005。将各个时间/频率区的对应信号输入到虚拟麦克风级1006，该虚拟麦克风级1006针对每个通道生成虚拟麦克风信号。具体地，为了生成例如中央通道的虚拟麦克风信号，将虚拟麦克风指向中央通道的方向，并且所得信号是中央通道的对应分量信号。然后经由直接信号分支1015和扩散信号分支1014处理该信号。两个分支都包括对应的增益调节器或放大器，这些增益调节器或放大器在块1007、1008中由从原始扩散度参数导出的扩散度值控制，并且在块1009、1010中被进一步处理以获得一定麦克风补偿。The DirAC synthesis stage shown in FIG12 b includes a bandpass filter 1005 for generating a time/frequency representation of the B-format microphone signals W, X, Y, Z. The corresponding signals of each time/frequency region are input to a virtual microphone stage 1006, which generates a virtual microphone signal for each channel. Specifically, in order to generate a virtual microphone signal of, for example, a central channel, the virtual microphone is pointed in the direction of the central channel, and the resulting signal is the corresponding component signal of the central channel. The signal is then processed via a direct signal branch 1015 and a diffuse signal branch 1014. Both branches include corresponding gain adjusters or amplifiers, which are controlled by diffuseness values derived from the original diffuseness parameters in blocks 1007, 1008, and are further processed in blocks 1009, 1010 to obtain a certain microphone compensation.

还使用从由方位角和仰角组成的方向参数导出的增益参数对直接信号分支1015中的分量信号进行增益调整。具体地，将这些角度输入到VBAP(向量基幅度平移)增益表1011中。对于每个通道，将结果输入到扬声器增益平均级1012和另一归一化器1013，然后将所得增益参数转发给直接信号分支1015中的放大器或增益调节器。在组合器1017中，对在解相关器1016的输出处生成的扩散信号和直接信号或非扩散流进行组合，然后，在另一组合器1018中将其他子带相加，该另一组合器1018例如可以是合成滤波器组。因此，生成用于某个扬声器的扬声器信号，并且针对某个扬声器设置中的其他扬声器1019的其他通道执行相同的过程。The component signals in the direct signal branch 1015 are also gain adjusted using gain parameters derived from the directional parameters consisting of azimuth and elevation angles. Specifically, these angles are input into a VBAP (Vector Basis Amplitude Translation) gain table 1011. For each channel, the results are input into a loudspeaker gain averaging stage 1012 and a further normalizer 1013, and the resulting gain parameters are then forwarded to the amplifier or gain adjuster in the direct signal branch 1015. In a combiner 1017, the diffuse signal generated at the output of the decorrelator 1016 and the direct signal or non-diffuse stream are combined, and then the other subbands are added in a further combiner 1018, which may be, for example, a synthesis filter bank. Thus, a loudspeaker signal for a certain loudspeaker is generated, and the same process is performed for other channels of other loudspeakers 1019 in a certain loudspeaker setup.

DirAC合成的高质量版本如图12b所示，其中，合成器接收所有B格式信号，根据这些B格式信号，针对每个扬声器方向计算虚拟麦克风信号。所使用的方向图通常是偶极子。然后，取决于关于分支1016和1015所讨论的元数据，以非线性方式修改虚拟麦克风信号。图12b中未示出DirAC的低比特率版本。然而，在该低比特率版本中，经发送单个音频通道。处理上的不同之处在于：所有虚拟麦克风信号都将被所接收到的单个音频通道代替。虚拟麦克风信号被划分为两个流，即被分别处理的扩散流和非扩散流。通过使用向量基幅度平移(VBAP)将非扩散声音再现为点源。在平移时，在单声道声音信号与扬声器特定增益因子相乘之后，将单声道声音信号应用于扬声器子集。使用扬声器设置的信息和指定的平移方向来计算增益因子。在低比特率版本中，将输入信号简单地平移到由元数据暗示的方向。在高质量版本中，每个虚拟麦克风信号与对应的增益因子相乘，这产生与平移相同的效果，但是它不太容易出现任何非线性伪音。A high-quality version of DirAC synthesis is shown in FIG12b, where the synthesizer receives all B-format signals, from which a virtual microphone signal is calculated for each speaker direction. The directional pattern used is typically a dipole. The virtual microphone signal is then modified in a nonlinear manner, depending on the metadata discussed with respect to branches 1016 and 1015. A low-bitrate version of DirAC is not shown in FIG12b. However, in this low-bitrate version, a single audio channel is sent. The difference in processing is that all virtual microphone signals will be replaced by a single audio channel received. The virtual microphone signal is divided into two streams, a diffuse stream and a non-diffuse stream that are processed separately. Non-diffuse sound is reproduced as a point source by using vector basis amplitude translation (VBAP). When panning, the monophonic sound signal is applied to a subset of speakers after being multiplied by a speaker-specific gain factor. The gain factor is calculated using information about the speaker settings and the specified panning direction. In the low-bitrate version, the input signal is simply panned to the direction implied by the metadata. In the high-quality version, each virtual microphone signal is multiplied by the corresponding gain factor, which produces the same effect as panning, but it is less prone to any non-linear artifacts.

扩散声音合成的目的是创建围绕听者的声音感知。在低比特率版本中，通过对输入信号进行解相关并从每个扬声器中再现它来再现扩散流。在高质量版本中，扩散流的虚拟麦克风信号在某种程度上已经不相干，并且仅需要轻微地对它们进行解相关处理。The goal of diffuse sound synthesis is to create the perception of sound surrounding the listener. In the low bitrate version, the diffuse flow is reproduced by decorrelating the input signal and reproducing it from each loudspeaker. In the high quality version, the virtual microphone signals of the diffuse flow are already somewhat incoherent, and they only need to be decorrelated slightly.

DirAC参数(也被称为空间元数据)由扩散度和方向的元组组成，该方向在球坐标中由两个角度(即，方位角和仰角)来表示。如果分析级和合成级两者都在解码器侧运行，则DirAC参数的时间-频率分辨率可以被选择为与用于DirAC分析和合成的滤波器组相同，即，用于音频信号的滤波器组表示的每个时隙和频率区间的不同参数集。The DirAC parameters (also called spatial metadata) consist of a tuple of diffuseness and direction, represented in spherical coordinates by two angles, i.e., azimuth and elevation. If both the analysis and synthesis stages are run on the decoder side, the time-frequency resolution of the DirAC parameters can be chosen to be the same as the filterbank used for DirAC analysis and synthesis, i.e., a different set of parameters for each time slot and frequency interval of the filterbank representation of the audio signal.

已经进行了一些工作来减少元数据的大小，以使DirAC范例能够用于空间音频编码和电话会议场景[Hirvonen2009]。Some work has been done to reduce the size of metadata to enable the DirAC paradigm to be used in spatial audio coding and teleconferencing scenarios [Hirvonen2009].

在[WO2019068638]中，介绍了一种基于DirAC的通用空间音频编码系统。与针对B格式(一阶环绕声格式)输入设计的经典DirAC相比，该系统可以接受一阶或更高阶环绕声、多通道或基于对象的音频输入，并且还允许混合类型输入信号。所有信号类型以单独或组合的方式进行高效编码和发送。前者组合了渲染器(解码器侧)处的不同表示，而后者在DirAC域中使用不同音频表示的编码器侧组合。In [WO2019068638], a general spatial audio coding system based on DirAC is introduced. Compared with the classic DirAC designed for B-format (first-order surround sound format) input, the system can accept first-order or higher-order surround sound, multi-channel or object-based audio input, and also allows mixed-type input signals. All signal types are efficiently encoded and sent separately or in combination. The former combines different representations at the renderer (decoder side), while the latter uses an encoder-side combination of different audio representations in the DirAC domain.

与DirAC框架的兼容性Compatibility with DirAC framework

本实施例建立在如[WO2019068638]中所提出的任意输入类型的统一框架之上，并且——与[WO2020249815]对多通道内容所做的类似——旨在消除无法将DirAC参数(方向和扩散度)高效地应用于对象输入的问题。事实上，根本不需要扩散度参数，但发现每个时间/频率单元的单个方向提示不足以再现高质量的对象内容。因此，该实施例建议每个时间/频率单元采用多个方向提示，并相应地引入适配的参数集，该适配的参数集在对象输入的情况下代替经典DirAC参数。This embodiment builds on a unified framework for arbitrary input types as proposed in [WO2019068638] and - similar to what [WO2020249815] did for multi-channel content - aims to eliminate the problem that DirAC parameters (direction and diffuseness) cannot be efficiently applied to object inputs. In fact, the diffuseness parameter is not needed at all, but it was found that a single directional cue per time/frequency unit is not sufficient to reproduce high-quality object content. Therefore, this embodiment proposes to adopt multiple directional cues per time/frequency unit and introduces an adapted parameter set accordingly, which replaces the classic DirAC parameters in the case of object input.

低比特率的灵活系统Flexible system for low bit rates

与从听者的角度使用基于场景的表示的DirAC相比，SAOC和SAOC-3D是针对基于通道和基于对象的内容而设计的，其中参数描述了通道/对象之间的关系。为了将基于场景的表示用于对象输入并因此与DirAC渲染器兼容，同时确保高效表示和高质量再现，需要适配的参数集以便也允许用信号发送多个方向提示。In contrast to DirAC, which uses a scene-based representation from the listener's perspective, SAOC and SAOC-3D are designed for channel-based and object-based content, where parameters describe the relationship between channels/objects. In order to use scene-based representation for object input and thus be compatible with the DirAC renderer while ensuring efficient representation and high-quality reproduction, an adapted parameter set is required in order to also allow signaling of multiple directional cues.

该实施例的重要目标是找到一种以低比特率对对象输入进行高效编码并且对于越来越多的对象具有良好可扩展性的方法。对每个对象信号进行离散编码无法提供这种可扩展性：每个附加对象都会导致整体比特率显著上升。如果数量增加的对象超过了所允许的比特率，这将直接导致输出信号非常明显的劣化；这种劣化是支持该实施例的另一论据。An important goal of this embodiment is to find a method that can efficiently encode the object input at a low bit rate and has good scalability for an increasing number of objects. Discrete encoding of each object signal does not provide such scalability: each additional object causes the overall bit rate to rise significantly. If the number of objects increases beyond the allowed bit rate, this will directly lead to a very noticeable degradation of the output signal; this degradation is another argument in favor of this embodiment.

本发明的目的是提供对多个音频对象进行编码或对编码音频信号进行解码的改进构思。It is an object of the present invention to provide improved concepts for encoding a plurality of audio objects or for decoding an encoded audio signal.

该目的是通过根据权利要求1所述的用于编码的装置、根据权利要求18所述的解码器、根据权利要求28所述的编码方法、根据权利要求29所述的解码方法、根据权利要求30所述的计算机程序或根据权利要求31所述的编码音频信号来实现的。This object is achieved by a device for encoding according to claim 1, a decoder according to claim 18, an encoding method according to claim 28, a decoding method according to claim 29, a computer program according to claim 30 or an encoded audio signal according to claim 31.

在本发明的一方面中，本发明基于以下发现：对于多个频率区间中的一个或多个频率区间，定义了至少两个相关音频对象，并且与这至少两个相关对象相关的参数数据被包括在编码器侧，并用于解码器侧，以获得高质量且高效的音频编码/解码构思。In one aspect of the present invention, the present invention is based on the following discovery: for one or more frequency intervals among multiple frequency intervals, at least two related audio objects are defined, and parameter data related to the at least two related objects are included on the encoder side and used on the decoder side to obtain a high-quality and efficient audio encoding/decoding concept.

根据本发明的另一方面，本发明基于以下发现：执行适合于与每个对象相关联的方向信息的特定下混，使得具有对整个对象(即，对于时间帧中的所有频率区间)有效的关联方向信息的每个对象被用于将该对象下混到多个传输通道中。方向信息的使用例如等同于生成传输通道作为具有某些可调整特性的虚拟麦克风信号。According to another aspect of the invention, the invention is based on the finding that a specific downmix adapted to the directional information associated with each object is performed such that each object with associated directional information valid for the entire object (i.e. for all frequency bins in a time frame) is used for downmixing the object into a plurality of transmission channels. The use of the directional information is, for example, equivalent to generating the transmission channels as virtual microphone signals with certain adjustable characteristics.

在解码器侧，执行依赖于协方差合成的特定合成，该协方差合成在特定实施例中特别适合于不受解相关器引入的伪音影响的高质量协方差合成。在其他实施例中，使用依赖于与标准协方差合成相关的特定改进的高级协方差合成，以便提高音频质量和/或减少计算在协方差合成内使用的混合矩阵所需的计算量。On the decoder side, a specific synthesis relying on covariance synthesis is performed, which in certain embodiments is particularly suitable for high quality covariance synthesis that is not affected by artifacts introduced by the decorrelator. In other embodiments, advanced covariance synthesis is used that relies on specific improvements with respect to standard covariance synthesis in order to improve the audio quality and/or reduce the amount of computation required to calculate the mixing matrix used within the covariance synthesis.

然而，即使在通过基于所发送的选择信息显式地确定时间/频率区间内的各个贡献来进行音频渲染的更经典合成中，音频质量相对于现有技术对象编码方法或通道下混方法而言是优越的。在这种情况下，每个时间/频率区间具有对象标识信息，并且当执行音频渲染时，即当考虑每个对象的方向贡献时，使用该对象标识来查找与该对象信息相关联的方向，以便确定每个时间/频率区间的各个输出通道的增益值。因此，当时间/频率区间中仅存在单个相关对象时，则基于对象ID和关联对象的方向信息的“码本”，针对每个时间/频率区间仅确定该单个对象的增益值。However, even in more classical synthesis where audio rendering is performed by explicitly determining the individual contributions within a time/frequency interval based on the sent selection information, the audio quality is superior to prior art object encoding methods or channel downmixing methods. In this case, each time/frequency interval has object identification information, and when audio rendering is performed, i.e. when the directional contribution of each object is considered, the object identification is used to look up the direction associated with the object information in order to determine the gain values of the individual output channels for each time/frequency interval. Therefore, when there is only a single relevant object in the time/frequency interval, based on the "codebook" of the object ID and the directional information of the associated object, only the gain value of the single object is determined for each time/frequency interval.

然而，当时间/频率区间中存在多于1个相关对象时，则计算每个相关对象的增益值，以便将传输通道的对应时间/频率区间分配到由用户提供的输出格式(例如，某个通道格式是立体声格式、5.1格式等)控制的对应输出通道中。无论增益值是否用于协方差合成的目的(即，用于应用混合矩阵将传输通道混合到输出通道中的目的)、或者增益值是否用于通过将增益值乘以一个或多个传输通道的对应时间/频率区间然后对对应时间/频率区间中的每个输出通道的贡献进行求和来显式地确定时间/频率区间中的每个对象的单独贡献(可能通过添加扩散信号分量得到增强)，输出音频质量由于通过确定每个频率区间的一个或多个相关对象所提供的灵活性而仍然得到增强。However, when there is more than one relevant object in a time/frequency interval, a gain value is calculated for each relevant object so as to allocate the corresponding time/frequency interval of the transmission channel to the corresponding output channel controlled by the output format provided by the user (e.g., whether a certain channel format is a stereo format, a 5.1 format, etc.). Regardless of whether the gain values are used for the purpose of covariance synthesis (i.e., for the purpose of applying a mixing matrix to mix the transmission channels into the output channels), or whether the gain values are used to explicitly determine the individual contribution of each object in the time/frequency interval (possibly enhanced by the addition of a diffuse signal component) by multiplying the gain value by the corresponding time/frequency interval of one or more transmission channels and then summing the contribution of each output channel in the corresponding time/frequency interval, the output audio quality is still enhanced due to the flexibility provided by determining one or more relevant objects per frequency interval.

这种确定可能是非常高效的，因为时间/频率区间的仅一个或多个对象ID必须被编码，并与每个对象的方向信息一起发送给解码器，然而，这也可能是非常高效的。这是因为：对于帧，所有频率区间仅存在单个方向信息。This determination may be very efficient, however, since only one or more object IDs for a time/frequency bin have to be encoded and sent to the decoder together with the direction information for each object. This is because: for a frame, only a single direction information exists for all frequency bins.

因此，无论合成是使用优选增强的协方差合成还是使用每个对象的显式传输通道贡献的组合进行的，都获得了高效和高质量的对象下混，该对象下混优选地通过使用特定对象方向相关下混来增强，该特定对象方向相关下混依赖于反映了生成传输通道作为虚拟麦克风信号的下混权重。Hence, regardless of whether the synthesis is performed using preferably enhanced covariance synthesis or using a combination of explicit transmission channel contributions for each object, an efficient and high quality object downmix is obtained, which is preferably enhanced by using a specific object direction-dependent downmix that relies on downmix weights that reflect the generation of the transmission channels as virtual microphone signals.

与每个时间/频率区间的两个或更多个相关对象相关的方面可以优选地与执行对象到传输通道的特定方向相关下混的方面相结合。然而，这两个方面也可以彼此独立地应用。此外，尽管在某些实施例中执行每个时间/频率区间的两个或更多个相关对象的协方差合成，但高级协方差合成和高级传输通道到输出通道上混也可以通过仅发送每个时间/频率区间的单个对象标识来执行。The aspects related to two or more related objects per time/frequency interval may preferably be combined with the aspects of performing a specific direction-dependent downmix of an object to a transmission channel. However, these two aspects may also be applied independently of each other. In addition, although covariance synthesis of two or more related objects per time/frequency interval is performed in some embodiments, advanced covariance synthesis and advanced transmission channel to output channel upmixing may also be performed by sending only a single object identifier per time/frequency interval.

此外，无论每个时间/频率区间是否存在单个或若干个相关对象，上混也可以通过计算标准或增强协方差合成中的混合矩阵来执行，或者上混可以通过单独确定时间/频率区间的贡献(其基于用于从方向“码本”中获取特定方向信息的对象标识)以确定对应贡献的增益值来执行。然后对这些进行求和，以便在每个时间/频率区间存在两个或更多个相关对象的情况下获得每个时间/频率区间的完整贡献。然后，该求和步骤的输出等同于混合矩阵应用的输出，并且执行最终的滤波器组处理以便生成对应输出格式的时域输出通道信号。In addition, whether there is a single or several related objects per time/frequency interval, upmixing can also be performed by calculating the mixing matrix in standard or enhanced covariance synthesis, or upmixing can be performed by individually determining the contribution of the time/frequency interval (which is based on the object identification used to obtain specific direction information from the direction "codebook") to determine the gain value of the corresponding contribution. These are then summed to obtain the complete contribution of each time/frequency interval in the case where there are two or more related objects per time/frequency interval. The output of this summing step is then equivalent to the output of the mixing matrix application, and the final filter bank processing is performed to generate the time domain output channel signal of the corresponding output format.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

随后参考附图描述本发明的优选实施例，在附图中：Preferred embodiments of the present invention are described below with reference to the accompanying drawings, in which:

图1a是根据每个时间/频率区间具有至少两个相关对象的第一方面的音频编码器的实现；FIG1a is an implementation of an audio encoder according to the first aspect with at least two related objects per time/frequency interval;

图1b是根据具有方向相关对象下混的第二方面的编码器的实现；FIG1 b is an implementation of an encoder according to the second aspect with direction-dependent object downmix;

图2是根据第二方面的编码器的优选实现；Fig. 2 is a preferred implementation of an encoder according to the second aspect;

图3是根据第一方面的编码器的优选实现；FIG3 is a preferred implementation of an encoder according to the first aspect;

图4是根据第一方面和第二方面的解码器的优选实现；FIG4 is a preferred implementation of a decoder according to the first aspect and the second aspect;

图5是图4的协方差合成处理的优选实现；FIG5 is a preferred implementation of the covariance synthesis process of FIG4;

图6a是根据第一方面的解码器的实现；FIG6a is an implementation of a decoder according to the first aspect;

图6b是根据第二方面的解码器；Fig. 6b is a decoder according to the second aspect;

图7a是用于示出根据第一方面确定参数信息的流程图；FIG7a is a flow chart for illustrating determining parameter information according to the first aspect;

图7b是进一步确定参数化数据的优选实现；FIG7 b is a preferred implementation of further determining parameterized data;

图8a示出了高分辨率滤波器组时间/频率表示；Fig. 8a shows a high resolution filter bank time/frequency representation;

图8b示出了根据第一方面和第二方面的优选实现的针对帧J的相关辅助信息的传输；FIG8b shows the transmission of relevant auxiliary information for frame J according to a preferred implementation of the first aspect and the second aspect;

图8c示出了包括在编码音频信号中的“方向码本”；FIG8c shows a "directional codebook" included in the encoded audio signal;

图9a示出了根据第二方面的优选编码方式；FIG9a shows a preferred encoding method according to the second aspect;

图9b示出了根据第二方面的静态下混的实现；FIG9b shows an implementation of static downmixing according to the second aspect;

图9c示出了根据第二方面的动态下混的实现；FIG9c shows an implementation of dynamic downmixing according to the second aspect;

图9d示出了第二方面的另一实施例；Figure 9d shows another embodiment of the second aspect;

图10a示出了第一方面的解码器侧的优选实现的流程图；FIG10a shows a flow chart of a preferred implementation of the decoder side of the first aspect;

图10b示出了根据对每个输出通道的贡献进行求和的实施例的图10a的输出通道计算的优选实现；FIG10b shows a preferred implementation of the output channel calculation of FIG10a according to an embodiment where the contribution of each output channel is summed;

图10c示出了根据第一方面的针对多个相关对象确定功率值的优选方式；FIG10c shows a preferred manner of determining power values for a plurality of related objects according to the first aspect;

图10d示出了使用依赖于对混合矩阵的计算和应用的协方差合成来计算图10a的输出通道的实施例；FIG10d shows an embodiment of calculating the output channels of FIG10a using covariance synthesis that relies on the calculation and application of a mixing matrix;

图11示出了用于对时间/频率区间的混合矩阵的高级计算的若干个实施例；FIG11 shows several embodiments for high-level calculation of a mixing matrix for time/frequency bins;

图12a示出了现有技术的DirAC编码器；以及FIG12a shows a prior art DirAC encoder; and

图12b示出了现有技术的DirAC解码器。Fig. 12b shows a prior art DirAC decoder.

具体实施方式DETAILED DESCRIPTION

图1a示出了用于对多个音频对象进行编码的装置，该装置在输入处接收原样的音频对象和/或音频对象的元数据。编码器包括对象参数计算器100，其针对时间/频率区间提供至少两个相关音频对象的参数数据，并且将该数据转发给输出接口200。具体地，对象参数计算器针对与时间帧相关的多个频率区间中的一个或多个频率区间计算至少两个相关音频对象的参数数据，其中，具体地，至少两个相关音频对象的数量低于多个音频对象的总数。因此，对象参数计算器100实际上执行选择，而不仅指示所有对象是相关的。在优选实施例中，该选择是通过相关性来进行的，并且相关性是通过幅度相关测量值(例如，幅度、功率、响度、或通过将幅度提高到不同于1并且优选地大于1的功率而获得的另一种测量值)来确定的。然后，如果一定数量的相关对象可用于时间/频率区间，则选择具有最相关特性的对象(即，在所有对象中具有最高功率的对象)，并且将关于这些所选对象的数据包括在参数数据中。FIG. 1a shows a device for encoding a plurality of audio objects, which receives the original audio objects and/or metadata of the audio objects at the input. The encoder comprises an object parameter calculator 100, which provides parameter data of at least two related audio objects for a time/frequency interval, and forwards the data to an output interface 200. Specifically, the object parameter calculator calculates parameter data of at least two related audio objects for one or more frequency intervals in a plurality of frequency intervals associated with a time frame, wherein, in particular, the number of at least two related audio objects is lower than the total number of a plurality of audio objects. Therefore, the object parameter calculator 100 actually performs a selection, rather than merely indicating that all objects are related. In a preferred embodiment, the selection is performed by correlation, and the correlation is determined by an amplitude-related measurement value (e.g., amplitude, power, loudness, or another measurement value obtained by increasing the amplitude to a power different from 1 and preferably greater than 1). Then, if a certain number of related objects are available for the time/frequency interval, the object with the most relevant characteristics (i.e., the object with the highest power among all objects) is selected, and the data about these selected objects are included in the parameter data.

输出接口200被配置为输出编码音频信号，该编码音频信号包括关于一个或多个频率区间的至少两个相关音频对象的参数数据的信息。取决于实现，输出接口可以接收其他数据(例如，对象下混、或表示对象下混的一个或多个传输通道、或附加参数、或若干个对象被下混的混合表示形式的对象波形数据、或单独表示形式的其他对象)并将其输入到编码音频信号中。在这种情况下，将对象直接引入或“复制”到对应的传输通道中。The output interface 200 is configured to output a coded audio signal including information about parameter data of at least two related audio objects for one or more frequency intervals. Depending on the implementation, the output interface may receive other data (e.g., object downmix, or one or more transmission channels representing the object downmix, or additional parameters, or object waveform data in a mixed representation of several objects downmixed, or other objects in a separate representation) and input it into the coded audio signal. In this case, the object is directly introduced or "copied" into the corresponding transmission channel.

图1b示出了根据第二方面的用于对多个音频对象进行编码的装置的优选实现，其中接收到音频对象和相关对象元数据，该相关对象元数据指示关于多个音频对象的方向信息，即，每个对象或一组对象(如果该组对象具有与其相关联的相同方向信息)的一个方向信息。将音频对象输入到下混器400中，以便对多个音频对象进行下混以获得一个或多个传输通道。此外，提供了传输通道编码器300，该传输通道编码器300对一个或多个传输通道进行编码以获得一个或多个编码传输通道，该一个或多个编码传输通道然后被输入到输出接口200中。具体地，下混器400连接到对象方向信息提供器110，该对象方向信息提供器110在输入处接收可以导出对象元数据的任何数据，并输出由下混器400实际使用的方向信息。从对象方向信息提供器110转发给下混400的方向信息优选地是去量化方向信息，即，然后在解码器侧可用的相同方向信息。为此，对象方向信息提供器110被配置为导出或提取或获取非量化对象元数据，然后对对象元数据进行量化以导出表示图1b所示的“其他数据”中的量化索引的量化对象元数据，该量化索引在优选实施例中被提供给输出接口200。此外，对象方向信息提供器110被配置为对量化对象方向信息进行去量化以获得从块110转发给下混器400的实际方向信息。FIG1 b shows a preferred implementation of an apparatus for encoding a plurality of audio objects according to the second aspect, wherein audio objects and associated object metadata are received, the associated object metadata indicating directional information about a plurality of audio objects, i.e. one directional information for each object or a group of objects (if the group of objects has the same directional information associated therewith). The audio objects are input into a downmixer 400 so as to downmix the plurality of audio objects to obtain one or more transmission channels. In addition, a transmission channel encoder 300 is provided, which encodes the one or more transmission channels to obtain one or more encoded transmission channels, which are then input into the output interface 200. Specifically, the downmixer 400 is connected to an object directional information provider 110, which receives at input any data from which the object metadata can be derived, and outputs directional information actually used by the downmixer 400. The directional information forwarded from the object directional information provider 110 to the downmixer 400 is preferably dequantized directional information, i.e. the same directional information then available at the decoder side. To this end, the object direction information provider 110 is configured to derive or extract or obtain non-quantized object metadata, and then quantize the object metadata to derive quantized object metadata representing the quantization index in the "other data" shown in Figure 1b, which is provided to the output interface 200 in a preferred embodiment. In addition, the object direction information provider 110 is configured to dequantize the quantized object direction information to obtain the actual direction information forwarded from the block 110 to the downmixer 400.

优选地，输出接口200被配置为附加地接收音频对象的参数数据、对象波形数据、每个时间/频率区间的单个或多个相关对象的一个或若干个标识、以及如之前所讨论的量化方向数据。Preferably, the output interface 200 is configured to additionally receive parameter data of audio objects, object waveform data, one or several identifications of single or multiple related objects per time/frequency bin, and quantized direction data as discussed previously.

随后，示出了其他实施例。提出了一种用于对音频对象信号进行编码的参数化方法，该参数化方法允许低比特率的高效传输以及消费者侧的高质量再现。基于考虑每个关键频带和时间实例(时间/频率区)的一个方向提示的DirAC原理，针对输入信号的时间/频率表示的每个这样的时间/频率区确定最主要的对象。由于这证明了对于对象输入来说是不够的，因此每个时间/频率区确定附加的、第二最主要的对象，并且基于这两个对象，计算功率比以确定两个对象中的每个对象对所考虑的时间/频率区的影响。注意：针对每个时间/频率单位考虑多于两个的最主要对象也是可想象的，特别是对于越来越多的输入对象。为了简单起见，以下描述主要基于每个时间/频率单元的两个主要对象。Subsequently, other embodiments are shown. A parameterized method for encoding audio object signals is proposed, which allows efficient transmission at low bit rates and high-quality reproduction on the consumer side. Based on the DirAC principle of a directional hint considering each key band and time instance (time/frequency zone), the most dominant object is determined for each such time/frequency zone of the time/frequency representation of the input signal. Since this proves to be insufficient for the object input, an additional, second most dominant object is determined for each time/frequency zone, and based on these two objects, a power ratio is calculated to determine the impact of each of the two objects on the considered time/frequency zone. Note: It is also conceivable to consider more than two most dominant objects for each time/frequency unit, especially for an increasing number of input objects. For simplicity, the following description is mainly based on two dominant objects per time/frequency unit.

因此，发送给解码器的参数化辅助信息包括：Therefore, the parameterized side information sent to the decoder includes:

·针对每个时间/频率区(或参数带)的相关(主要)对象子集计算的功率比。• Power ratios calculated for the relevant (dominant) object subset for each time/frequency region (or parameter band).

·表示每个时间/频率区(或参数带)的相关对象子集的对象索引。• An object index representing the relevant subset of objects for each time/frequency region (or parameter band).

·与对象索引相关联并针对每个帧提供的方向信息(其中每个时域帧包括多个参数带，并且每个参数带包括多个时间/频率区)。• Direction information associated with the object index and provided for each frame (where each time-domain frame comprises a plurality of parameter bands, and each parameter band comprises a plurality of time/frequency bins).

方向信息可经由与音频对象信号相关联的输入元数据文件获得。例如，可以基于帧来指定元数据。除了辅助信息之外，组合输入对象信号的下混信号也被发送给解码器。The directional information may be obtained via an input metadata file associated with the audio object signal. For example, the metadata may be specified on a frame basis. In addition to the auxiliary information, a downmix signal combining the input object signals is also sent to the decoder.

在渲染阶段期间，所发送的方向信息(经由对象索引导出)用于将所发送的下混信号(或更一般地：传输通道)平移到适当方向。基于所发送的功率比将下混信号分配到两个相关的对象方向，该功率比用作加权因子。该处理是针对解码下混信号的时间/频率表示的每个时间/频率区进行的。During the rendering phase, the sent directional information (derived via the object index) is used to translate the sent downmix signal (or more generally: the transmission channel) to the appropriate direction. The downmix signal is distributed to the two relevant object directions based on the sent power ratio, which is used as a weighting factor. This process is performed for each time/frequency region of the time/frequency representation of the decoded downmix signal.

本节对编码器侧处理进行了总结，然后详细描述了参数和下混计算。音频编码器接收一个或多个音频对象信号。对于每个音频对象信号，描述对象属性的元数据文件是相关联的。在该实施例中，关联元数据文件中所描述的对象属性对应于基于帧提供的方向信息，其中，一个帧对应于20毫秒。每个帧由帧号来标识，也被包含在元数据文件中。方向信息以方位角和仰角信息的形式给出，其中，方位角从(-180，180]度中取值，而仰角从[-90，90]度中取值。元数据中提供的其他属性可以包括例如距离、传播、增益；在该实施例中未考虑这些属性。This section summarizes the encoder-side processing and then describes the parameters and downmix calculations in detail. The audio encoder receives one or more audio object signals. For each audio object signal, a metadata file describing the object properties is associated. In this embodiment, the object properties described in the associated metadata file correspond to directional information provided on a frame basis, where a frame corresponds to 20 milliseconds. Each frame is identified by a frame number and is also included in the metadata file. The directional information is given in the form of azimuth and elevation information, where the azimuth takes values from (-180, 180] degrees and the elevation takes values from [-90, 90] degrees. Other properties provided in the metadata may include, for example, distance, propagation, gain; these properties are not considered in this embodiment.

元数据文件中提供的信息与实际音频对象文件一起用于创建参数集，该参数集被发送给解码器并用于对最终音频输出文件进行渲染。更具体地，编码器估计每个给定时间/频率区的主要对象子集的参数，即功率比。主要对象子集由对象索引来表示，该对象索引也用于标识对象方向。这些参数与传输通道和方向元数据一起被发送给解码器。The information provided in the metadata file is used together with the actual audio object files to create a parameter set that is sent to the decoder and used to render the final audio output file. More specifically, the encoder estimates the parameters, i.e., power ratios, of the main object subsets for each given time/frequency region. The main object subsets are represented by object indices, which are also used to identify the object directions. These parameters are sent to the decoder along with the transmission channel and direction metadata.

图2给出了编码器的概览，其中传输通道包括根据输入对象文件计算的下混信号和在输入元数据中提供的方向信息。传输通道的数量总是小于输入目标文件的数量。在实施例的编码器中，编码音频信号由编码传输通道来表示，并且编码参数化辅助信息由编码对象索引、编码功率比和编码方向信息指示。编码传输通道和编码参数化辅助信息两者一起形成由多路复用器220输出的比特流。具体地，编码器包括接收输入对象音频文件的滤波器组102。此外，对象元数据文件被提供给提取器方向信息块110a。块110a的输出被输入到量化方向信息块110b中，该量化方向信息块110b将方向信息输出到执行下混计算的下混器400。此外，经量化的方向信息(即，量化索引)从块110b转发给编码方向信息202块，该编码方向信息202块优选地执行某种熵编码以便进一步降低所需的比特率。FIG. 2 gives an overview of the encoder, wherein the transmission channel includes a downmix signal calculated from an input object file and directional information provided in the input metadata. The number of transmission channels is always less than the number of input target files. In the encoder of the embodiment, the encoded audio signal is represented by the encoded transmission channel, and the encoded parameterized auxiliary information is indicated by the encoded object index, the coding power ratio and the encoding direction information. The encoded transmission channel and the encoded parameterized auxiliary information together form a bit stream output by the multiplexer 220. Specifically, the encoder includes a filter bank 102 that receives the input object audio file. In addition, the object metadata file is provided to the extractor direction information block 110a. The output of block 110a is input into a quantized direction information block 110b, which outputs the direction information to the downmixer 400 that performs the downmix calculation. In addition, the quantized direction information (i.e., the quantized index) is forwarded from block 110b to the encoding direction information 202 block, which preferably performs some entropy coding to further reduce the required bit rate.

此外，滤波器组102的输出被输入到信号功率计算块104，并且信号功率计算块104的输出被输入到对象选择块106并且附加地被输入到功率比计算块108。功率比计算块108还连接到对象选择块106，以便计算功率比(即，仅用于所选对象的组合值)。在块210中，对所计算的功率比或组合值进行量化和编码。如稍后将概述的，功率比是优选的，以便保存一个功率数据项的传输。然而，在不需要这种保存的其他实施例中，不是功率比，而是实际信号功率或从由块104确定的信号功率中导出的其他值可以在对象选择器106的选择下被输入到量化器和编码器中。然后，不需要功率比计算108，并且对象选择106确保仅将相关参数化数据(即，相关对象的功率相关数据)输入到块210中以用于量化和编码的目的。In addition, the output of filter bank 102 is input to signal power calculation block 104, and the output of signal power calculation block 104 is input to object selection block 106 and is additionally input to power ratio calculation block 108. Power ratio calculation block 108 is also connected to object selection block 106, so that power ratio (that is, only for the combined value of selected object) is calculated. In block 210, the calculated power ratio or combined value is quantized and encoded. As will be outlined later, power ratio is preferred, so as to save the transmission of a power data item. However, in other embodiments that do not need this preservation, it is not power ratio, but actual signal power or other values derived from the signal power determined by block 104 can be input to quantizer and encoder under the selection of object selector 106. Then, power ratio calculation 108 is not needed, and object selection 106 ensures that only relevant parameterized data (that is, power-related data of relevant objects) are input to block 210 for the purpose of quantization and encoding.

将图1a与图2进行比较，块102、104、110a、110b、106、108优选地被包括在图1a的对象参数计算器100中，并且块202、210、220优选地被包括在图1a的输出接口块200中。Comparing Fig. 1a with Fig. 2, blocks 102, 104, 110a, 110b, 106, 108 are preferably included in the object parameter calculator 100 of Fig. 1a, and blocks 202, 210, 220 are preferably included in the output interface block 200 of Fig. 1a.

此外，图2中的核心编码器300对应于图1b的传输通道编码器300，下混计算块400对应于图1b的下混器400，以及图1b的对象方向信息提供器110对应于图2的块110a、110b。此外，图1b的输出接口200优选地以与图1a的输出接口200相同的方式实现并且包括图2的块202、210、220。Furthermore, the core encoder 300 in Fig. 2 corresponds to the transmission channel encoder 300 of Fig. 1b, the downmix calculation block 400 corresponds to the downmixer 400 of Fig. 1b, and the object direction information provider 110 of Fig. 1b corresponds to the blocks 110a, 110b of Fig. 2. Furthermore, the output interface 200 of Fig. 1b is preferably implemented in the same manner as the output interface 200 of Fig. 1a and comprises the blocks 202, 210, 220 of Fig. 2.

图3示出了编码器变体，其中，下混计算是可选的并且不依赖于输入元数据。在该变体中，输入音频文件可以直接馈送到核心编码器中，该核心编码器从它们创建传输通道，因此传输通道的数量对应于输入对象文件的数量；如果输入对象的数量是1或2，这将特别有趣。对于大量对象，下混信号仍将用于减少要发送的数据量。Figure 3 shows an encoder variant where the downmix calculation is optional and does not depend on the input metadata. In this variant, the input audio files can be fed directly into the core encoder, which creates transport channels from them, so the number of transport channels corresponds to the number of input object files; this is particularly interesting if the number of input objects is 1 or 2. For a large number of objects, the downmix signal will still be used to reduce the amount of data to be sent.

在图3中，相似的附图标记指代图2的相似功能。这不仅针对图2和图3有效，而且对本说明书中描述的所有其他图也有效。与图2不同，图3在没有任何方向信息的情况下执行下混计算400。因此，下混计算可以是例如使用预先已知的下混矩阵的静态下混，或者可以是不依赖于与包括在输入对象音频文件中的对象相关联的任何方向信息的能量相关下混。然而，方向信息在块110a中被提取并且在块110b中被量化，并且经量化的值被转发给方向信息编码器202以便在形成比特流的编码音频信号(例如，二进制编码音频信号)中具有编码方向信息。In FIG. 3 , similar reference numerals refer to similar functions of FIG. 2 . This is valid not only for FIG. 2 and FIG. 3 , but also for all other figures described in this specification. Unlike FIG. 2 , FIG. 3 performs a downmix calculation 400 without any directional information. Thus, the downmix calculation may be, for example, a static downmix using a pre-known downmix matrix, or may be an energy-dependent downmix that is independent of any directional information associated with an object included in the input object audio file. However, the directional information is extracted in block 110 a and quantized in block 110 b, and the quantized value is forwarded to a directional information encoder 202 so as to have the encoded directional information in the encoded audio signal (e.g., a binary encoded audio signal) forming the bitstream.

在具有不太多数量的输入音频对象文件的情况下或者在具有足够的可用传输带宽的情况下，下混计算块400也可以被省去，使得输入音频对象文件直接表示由核心编码器编码的传输通道。在这种实现中，也不需要块104、104、106、108、210。然而，优选的实现导致混合实现，其中，一些对象被直接引入到传输通道中，而其他对象被下混到一个或多个传输通道中。在这种情况下，则图3所示的所有块将是必需的，以便直接生成如下比特流，该比特流在编码传输通道内具有一个或多个对象以及由图2或图3的下混器400生成的一个或多个传输通道。In the case of not too many input audio object files or in the case of having enough available transmission bandwidth, the downmix calculation block 400 can also be omitted so that the input audio object files directly represent the transmission channels encoded by the core encoder. In this implementation, blocks 104, 104, 106, 108, 210 are also not needed. However, a preferred implementation leads to a hybrid implementation in which some objects are directly introduced into the transmission channels, while other objects are downmixed into one or more transmission channels. In this case, all blocks shown in FIG. 3 will be necessary to directly generate a bitstream having one or more objects and one or more transmission channels generated by the downmixer 400 of FIG. 2 or FIG. 3 within the encoded transmission channel.

参数计算Parameter calculation

使用滤波器组将包括所有输入对象信号的时域音频信号转换到时域/频域中。例如：CLDFB(复杂低延迟滤波器组)分析滤波器将20毫秒的帧(对应于48kHz的采样率下的960个样本)转换为大小为16x60的时间/频率区，其中具有16个时隙和60个频带。对于每个时间/频率单元，瞬时信号功率被计算为：The time domain audio signal including all input object signals is converted into the time domain/frequency domain using a filter bank. For example: the CLDFB (complex low delay filter bank) analysis filter converts a 20 millisecond frame (corresponding to 960 samples at a sampling rate of 48kHz) into a time/frequency region of size 16x60, with 16 time slots and 60 frequency bands. For each time/frequency unit, the instantaneous signal power is calculated as:

P_i(k，n)＝|x_i(k，n)|²，P _i (k, n) = | x _i (k, n) | ² ,

其中，k表示频带索引，n表示时隙索引，以及i表示对象索引。由于每个时间/频率区的发送参数在最终比特率方面的成本非常高，因此采用分组以便计算针对减少数量的时间/频率区的参数。例如：16个时隙可以一起被组合为单个时隙，并且60个频带可以基于心理声学尺度被组合为11个频带。这将16x60的初始尺寸减小到与11个所谓的参兹带相对应的1x11。基于分组对瞬时信号功率值进行求和，以获得减小维度的信号功率：Where k denotes the frequency band index, n denotes the time slot index, and i denotes the object index. Since the transmission parameters for each time/frequency zone are very costly in terms of the final bit rate, grouping is employed in order to calculate the parameters for a reduced number of time/frequency zones. For example: 16 time slots can be grouped together into a single time slot, and 60 frequency bands can be grouped into 11 frequency bands based on a psychoacoustic scale. This reduces the initial size of 16x60 to 1x11 corresponding to 11 so-called parametric bands. The instantaneous signal power values are summed on a grouping basis to obtain the signal power of the reduced dimension:

其中，T对应于该示例中的15，并且B_S和B_E定义参数带边界。where T corresponds to 15 in this example, and _BS and _BE define the parameter band boundaries.

为了确定针对其计算参数的最主要对象的子集，所有N个输入音频对象的瞬时信号功率值按降序进行排序。在该实施例中，我们确定两个最主要对象，并且范围从0至N-1的对应对象索引被存储为要发送的参数的一部分。此外，计算了将两个主要对象信号彼此相关的功率比：To determine the subset of the most dominant objects for which parameters are calculated, the instantaneous signal power values of all N input audio objects are sorted in descending order. In this embodiment, we determine the two most dominant objects, and the corresponding object indices ranging from 0 to N-1 are stored as part of the parameters to be sent. In addition, the power ratio that relates the two dominant object signals to each other is calculated:

或者用不限于两个对象的更一般的表达方式：Or in a more general way that is not limited to two objects:

其中，在该上下文中，S表示要考虑的主要对象的数量，并且：where, in this context, S represents the number of primary objects to be considered, and:

在两个主要对象的情况下，对于两个对象中的每个对象，0.5的功率比意味着两个对象同样存在于对应的参数带内，而1和0的功率比表示两个对象之一不存在。这些功率比被存储为要发送的参数的第二部分。由于功率比总和为1，因此发送S-1个而不是S个值就足够了。In the case of two main objects, for each of the two objects, a power ratio of 0.5 means that both objects are equally present in the corresponding parameter band, while power ratios of 1 and 0 indicate that one of the two objects is absent. These power ratios are stored as the second part of the parameters to be sent. Since the power ratios sum to 1, it is sufficient to send S-1 values instead of S.

除了每个参数带的对象索引和功率比值之外，还必须发送从输入元数据文件中提取的每个对象的方向信息。由于信息最初是基于帧来提供的，因此这是针对每个帧进行的(其中，每个帧在所描述的示例中包括11个参数带或总共16x60个时间/频率区)。因此，对象索引间接地表示对象方向。注意：由于功率比总和为1，因此每个参数带要发送的功率比的数量可以减少1；例如：在考虑2个相关对象的情况下，发送1个功率比值就足够了。In addition to the object index and the power ratio value for each parameter band, the direction information for each object extracted from the input metadata file must be sent. Since the information is originally provided on a frame basis, this is done for each frame (wherein each frame comprises 11 parameter bands or a total of 16x60 time/frequency bins in the described example). The object index therefore indirectly indicates the object direction. Note: Since the power ratios sum to 1, the number of power ratios to be sent per parameter band can be reduced by 1; for example: in the case where 2 related objects are considered, sending 1 power ratio value is sufficient.

对方向信息和功率比值两者进行量化并与对象索引进行组合以形成参数化辅助信息。然后，该参数化辅助信息被编码，并与编码传输通道/下混信号一起被混合到最终比特流表示中。例如，通过使用每个值3个比特来量化功率比，实现输出质量和扩展比特率之间的良好折衷。方向信息可以被提供有5度的角分辨率，并随后以每个方位角值7个比特和每个仰角值6个比特进行量化，以给出实际示例。Both the directional information and the power ratio value are quantized and combined with the object index to form parameterized side information. This parameterized side information is then encoded and mixed into the final bitstream representation together with the coded transport channel/downmix signal. For example, by quantizing the power ratio using 3 bits per value, a good compromise between output quality and extended bitrate is achieved. The directional information can be provided with an angular resolution of 5 degrees and then quantized with 7 bits per azimuth value and 6 bits per elevation value to give a practical example.

下混计算Downmix calculation

所有输入音频对象信号被组合为下混信号，该下混信号包括一个或多个传输通道，其中，传输通道的数量小于输入对象信号的数量。注意：在该实施例中，如果仅存在一个输入对象，则仅出现单个传输通道，这因此意味着下混计算被跳过。All input audio object signals are combined into a downmix signal comprising one or more transmission channels, wherein the number of transmission channels is smaller than the number of input object signals. Note: In this embodiment, if only one input object is present, only a single transmission channel is present, which therefore means that the downmix calculation is skipped.

如果下混包括两个传输通道，则该立体声下混例如可以被计算为虚拟心形麦克风信号。虚拟心形麦克风信号是通过在元数据文件中应用针对每个帧提供的方向信息来确定的(这里，假设所有仰角值为零)：If the downmix comprises two transmission channels, the stereo downmix may for example be calculated as a virtual cardioid microphone signal. The virtual cardioid microphone signal is determined by applying the direction information provided for each frame in the metadata file (here, all elevation angle values are assumed to be zero):

w_L＝0.5+0.5*cos(方位角-pi/2)w _L = 0.5 + 0.5 * cos (azimuth - pi / 2)

w_R＝0.5+0.5*cos(方位角+pi/2)w _R = 0.5 + 0.5 * cos (azimuth + pi / 2)

在这里，虚拟心形位于90°和-90°处。因此确定两个传输通道(左和右)中每个传输通道的单独权重，并将这些单独权重应用于对应的音频对象信号：Here, the virtual cardioid is located at 90° and -90°. Therefore, individual weights for each of the two transmission channels (left and right) are determined and applied to the corresponding audio object signals:

在该上下文中，N是大于或等于2的输入对象数量。如果针对每个帧更新了虚拟心形权重，则采用适合于方向信息的动态下混。另一种可能性是采用固定下混，其中假设每个对象位于静态位置处。该静态位置例如可以对应于对象的初始方向，这然后导致对于所有帧都相同的静态虚拟心形权重。In this context, N is the number of input objects greater than or equal to 2. If the virtual cardioid weights are updated for each frame, a dynamic downmix adapted to the direction information is employed. Another possibility is to employ a fixed downmix, where each object is assumed to be located at a static position. This static position may correspond, for example, to the initial orientation of the object, which then results in a static virtual cardioid weight that is the same for all frames.

如果目标比特率允许，则可以设想多于两个传输通道。在三个传输通道的情况下，则心形可以均匀地布置在例如0°、120°和-120°处。如果使用四个传输通道，则第四心形可以面朝上或者四个心形可以再次以均匀的方式水平地布置。如果对象位置例如仅是一个半球的一部分，则该布置也可以针对对象位置定制。所得下混信号由核心编码器进行处理，并与编码参数化辅助信息一起被转化为比特流表示。If the target bitrate allows, more than two transmission channels can be envisaged. In case of three transmission channels, the cardioids can be arranged uniformly at, for example, 0°, 120° and -120°. If four transmission channels are used, a fourth cardioid can face upwards or the four cardioids can be arranged horizontally again in a uniform manner. The arrangement can also be customized to the object position if it is, for example, only a part of one hemisphere. The resulting downmix signal is processed by the core encoder and converted into a bitstream representation together with the encoded parameterized auxiliary information.

备选地，可以将输入对象信号馈送到核心编码器中而不被组合为下混信号。在这种情况下，所得传输通道的数量对应于输入对象信号的数量。通常，给出与总比特率相关的传输通道的最大数量。然后，仅当输入对象信号的数量超过传输通道的该最大数量时才采用下混信号。Alternatively, the input object signals may be fed into the core encoder without being combined into a downmix signal. In this case, the resulting number of transmission channels corresponds to the number of input object signals. Typically, a maximum number of transmission channels is given that is relevant to the total bit rate. The downmix signal is then only employed if the number of input object signals exceeds this maximum number of transmission channels.

图6a示出了用于对编码音频信号(例如，由图1a或图2或图3输出的信号)进行解码的解码器，该编码音频信号包括多个音频对象的方向信息和一个或多个传输通道。此外，对于时间帧的一个或多个频率区间，编码音频信号包括至少两个相关音频对象的参数数据，其中，至少两个相关对象的数量低于多个音频对象的总数。具体地，解码器包括用于提供频谱表示形式的一个或多个传输通道的输入接口，该频谱表示在时间帧中具有多个频率区间。这表示从输入接口块600转发给音频渲染器块700的信号。具体地，音频渲染器700被配置用于使用包括在编码音频信号中的方向信息将一个或多个传输通道渲染为多个音频通道，音频通道的数量对于立体声输出格式优选地是两个通道，或者对于较多数量的输出格式(例如，3通道、5通道、5.1通道等)优选地是多于两个通道。具体地，音频渲染器700被配置为：针对一个或多个频率区间中的每一频率区间，根据与至少两个相关音频对象中的第一相关音频对象相关联的第一方向信息以及根据与至少两个相关音频对象中的第二相关音频对象相关联的第二方向信息来计算一个或多个传输通道的贡献。具体地，多个音频对象的方向信息包括与第一对象相关联的第一方向信息、以及与第二对象相关联的第二方向信息。Fig. 6a shows a decoder for decoding a coded audio signal (e.g., a signal output by Fig. 1a or Fig. 2 or Fig. 3), the coded audio signal comprising directional information of a plurality of audio objects and one or more transmission channels. In addition, for one or more frequency intervals of a time frame, the coded audio signal comprises parameter data of at least two related audio objects, wherein the number of at least two related objects is lower than the total number of the plurality of audio objects. Specifically, the decoder comprises an input interface for providing one or more transmission channels in the form of a spectrum representation, the spectrum representation having a plurality of frequency intervals in the time frame. This represents a signal forwarded to an audio renderer block 700 from an input interface block 600. Specifically, the audio renderer 700 is configured to render one or more transmission channels as a plurality of audio channels using the directional information included in the coded audio signal, the number of audio channels being preferably two channels for a stereo output format, or preferably more than two channels for a larger number of output formats (e.g., 3 channels, 5 channels, 5.1 channels, etc.). Specifically, the audio renderer 700 is configured to: for each frequency interval in the one or more frequency intervals, calculate the contribution of the one or more transmission channels according to the first direction information associated with the first related audio object in the at least two related audio objects and according to the second direction information associated with the second related audio object in the at least two related audio objects. Specifically, the direction information of the multiple audio objects includes the first direction information associated with the first object and the second direction information associated with the second object.

图8b示出了帧的参数数据，在一个优选实施例中，该参数数据包括多个音频对象的方向信息810，此外还包括在812处示出的一定数量的参数带中的每个参数带的功率比，以及在块814处指示的每个参数带的一个(优选地是两个或甚至更多个)对象索引。具体地，图8c中更详细地示出了多个音频对象810的方向信息。图8c示出了具有从1至N的某个对象ID的第一列的表，其中N是多个音频对象的数量。此外，提供了第二列，其具有每个对象的方向信息，该方向信息优选地作为方位角值和仰角值，或者在二维情况的情况下，仅作为方位角值。这在818处被示出。因此，图8c示出了输入到图6a的输入接口600中的被包括在编码音频信号中的“方向码本”。来自列818的方向信息与来自列816的某个对象ID唯一地相关联，并且对帧中的“整个”对象(即，对帧中的所有频带)有效。因此，无论频率区间的数量是高分辨率表示中的时间/频率区还是较低分辨率表示中的时间/参数带，仅单个方向信息被发送并被输入接口用于每个对象标识。FIG8b shows parameter data of a frame, which in a preferred embodiment includes directional information 810 of a plurality of audio objects, and in addition includes a power ratio of each parameter band in a certain number of parameter bands shown at 812, and one (preferably two or even more) object index for each parameter band indicated at block 814. Specifically, the directional information of a plurality of audio objects 810 is shown in more detail in FIG8c. FIG8c shows a table with a first column of a certain object ID from 1 to N, where N is the number of a plurality of audio objects. In addition, a second column is provided, which has directional information for each object, preferably as an azimuth value and an elevation value, or in the case of a two-dimensional case, as an azimuth value only. This is shown at 818. Therefore, FIG8c shows a "directional codebook" included in the encoded audio signal input to the input interface 600 of FIG6a. The directional information from column 818 is uniquely associated with a certain object ID from column 816 and is valid for the "entire" object in the frame (i.e., for all frequency bands in the frame). Thus, regardless of the number of frequency bins, whether time/frequency bins in a high resolution representation or time/parameter bands in a lower resolution representation, only a single direction information is sent and input to the interface for each object identification.

在该上下文中，图8a示出了当图2或图3的滤波器组102被实现为之前讨论的CLDFB(复杂低延迟滤波器组)时由该滤波器组生成的时间/频率表示。对于如之前关于图8b和图8c讨论给出方向信息的帧，滤波器组生成图8a中的范围从0至15的16个时隙和范围从0至59的60个频带。因此，一个时隙和一个频带代表时间/频率区802或804。然而，为了降低辅助信息的比特率，优选地将高分辨率表示转换为低分辨率表示，如图8b所示，其中仅存在单个时间区间，并且其中60个频带被转换为如图8b中的812处所示的11个参数带。因此，如图10c所示，高分辨率表示由时隙索引n和频带索引k来指示，并且低分辨率表示由分组时隙索引m和参数带索引1给出。然而，在本说明书的上下文中，时间/频率区间可以包括图8a的高分辨率时间/频率区802、804或由图10c中的块731c的输入处的分组时隙索引和参数带索引标识的低分辨率时间/频率单元。In this context, Fig. 8a shows the time/frequency representation generated by the filter bank 102 when Fig. 2 or Fig. 3 is implemented as the CLDFB (complex low delay filter bank) discussed before. For the frame that gives the direction information as discussed before about Fig. 8b and Fig. 8c, the filter bank generates 16 time slots ranging from 0 to 15 and 60 frequency bands ranging from 0 to 59 in Fig. 8a. Therefore, one time slot and one frequency band represent time/frequency area 802 or 804. However, in order to reduce the bit rate of auxiliary information, it is preferred that the high-resolution representation is converted to a low-resolution representation, as shown in Fig. 8b, wherein there is only a single time interval, and wherein the 60 frequency bands are converted into 11 parameter bands as shown in 812 places in Fig. 8b. Therefore, as shown in Fig. 10c, the high-resolution representation is indicated by the time slot index n and the frequency band index k, and the low-resolution representation is given by the grouping time slot index m and the parameter band index 1. However, in the context of the present specification, the time/frequency interval may include the high resolution time/frequency zones 802, 804 of Figure 8a or the low resolution time/frequency units identified by the grouped slot index and parameter band index at the input of block 731c in Figure 10c.

在图6a的实施例中，音频渲染器700被配置为：针对一个或多个频率区间中的每一频率区间，根据与至少两个相关音频对象中的第一相关音频对象相关联的第一方向信息以及根据与至少两个相关音频对象中的第二相关音频对象相关联的第二方向信息来计算一个或多个传输通道的贡献。在图8b所示的实施例中，块814具有参数带中每个相关对象的对象索引，即具有两个或更多个对象索引，使得每个时间频率区间存在两个贡献。In the embodiment of Fig. 6a, the audio renderer 700 is configured to calculate, for each frequency interval in the one or more frequency intervals, the contribution of the one or more transmission channels according to the first directional information associated with the first related audio object in the at least two related audio objects and according to the second directional information associated with the second related audio object in the at least two related audio objects. In the embodiment shown in Fig. 8b, the block 814 has an object index for each related object in the parameter band, i.e., has two or more object indices, so that there are two contributions for each time-frequency interval.

如稍后将关于图1 0a概述的，对贡献的计算可以经由混合矩阵间接地进行，其中每个相关对象的增益值被确定并用于计算混合矩阵。备选地，如图10b所示，可以使用增益值再次显式地计算贡献，然后针对某个时间/频率区间中的每个输出通道，对经显式计算的贡献进行求和。因此，无论贡献是被显式计算的还是被隐式计算的，音频渲染器仍然使用方向信息将一个或多个传输通道渲染到多个音频通道中，使得对于一个或多个频率区间中的每一频率区间，根据与至少两个相关音频对象中的第一相关音频对象相关联的第一方向信息以及根据与至少两个相关音频对象中的第二相关音频对象相关联的第二方向信息，将一个或多个传输通道的贡献包括在多个音频通道中。As will be outlined later with respect to Figure 10a, the calculation of the contribution can be performed indirectly via a mixing matrix, wherein a gain value for each related object is determined and used to calculate the mixing matrix. Alternatively, as shown in Figure 10b, the contribution can be explicitly calculated again using the gain value, and then the explicitly calculated contribution is summed for each output channel in a certain time/frequency interval. Therefore, regardless of whether the contribution is calculated explicitly or implicitly, the audio renderer still uses the directional information to render one or more transmission channels into multiple audio channels, so that for each frequency interval in one or more frequency intervals, the contribution of one or more transmission channels is included in the multiple audio channels according to the first directional information associated with the first related audio object of the at least two related audio objects and according to the second directional information associated with the second related audio object of the at least two related audio objects.

图6b示出了根据第二方面的用于对编码音频信号进行解码的解码器，该编码音频信号包括：多个音频对象的方向信息和一个或多个传输通道；以及对于时间帧的一个或多个频率区间而言的音频对象的参数数据。同样，解码器包括接收编码音频信号的输入接口600，并且解码器包括用于使用方向信息将一个或多个传输通道渲染为多个音频通道的音频渲染器700。具体地，音频渲染器被配置为根据多个频率区间中的每个频率区间的一个或多个音频对象以及与频率区间中的一个或多个相关音频对象相关联的方向信息，计算直接响应信息。该直接响应信息优选地包括用于协方差合成或高级协方差合成或用于显式计算一个或多个传输通道的贡献的增益值。Figure 6b shows a decoder for decoding an encoded audio signal according to the second aspect, the encoded audio signal comprising: directional information of multiple audio objects and one or more transmission channels; and parameter data of the audio objects for one or more frequency intervals of a time frame. Similarly, the decoder includes an input interface 600 for receiving the encoded audio signal, and the decoder includes an audio renderer 700 for rendering one or more transmission channels as multiple audio channels using the directional information. Specifically, the audio renderer is configured to calculate direct response information based on one or more audio objects in each frequency interval in the multiple frequency intervals and the directional information associated with one or more related audio objects in the frequency interval. The direct response information preferably includes a gain value for covariance synthesis or advanced covariance synthesis or for explicitly calculating the contribution of one or more transmission channels.

优选地，音频渲染器被配置为使用时间/频带中的一个或多个相关音频对象的直接响应信息并使用关于多个音频通道的信息来计算协方差合成信息。此外，将协方差合成信息(其优选地是混合矩阵)应用于一个或多个传输通道以获得多个音频通道。在另一实现中，直接响应信息是一个或多个音频对象中的每个音频对象的直接响应向量，并且协方差合成信息是协方差合成矩阵，并且音频渲染器被配置为在应用协方差合成信息时针对每个频率区间执行矩阵运算。Preferably, the audio renderer is configured to use direct response information of one or more related audio objects in time/frequency band and use information about multiple audio channels to calculate covariance synthesis information. In addition, the covariance synthesis information (which is preferably a mixing matrix) is applied to one or more transmission channels to obtain multiple audio channels. In another implementation, the direct response information is a direct response vector of each audio object in the one or more audio objects, and the covariance synthesis information is a covariance synthesis matrix, and the audio renderer is configured to perform matrix operations for each frequency interval when applying the covariance synthesis information.

此外，音频渲染器700被配置为：在计算直接响应信息时，导出一个或多个音频对象的直接响应向量，并且针对一个或多个音频对象，根据每个直接响应向量来计算协方差矩阵。此外，在计算协方差合成信息时，计算目标协方差矩阵。然而，可以使用目标协方差矩阵的相关信息(即，一个或多个最主要对象的直接响应矩阵或向量，以及通过应用功率比而确定的被指示为E的直接功率的对角矩阵)而不是目标协方差矩阵。In addition, the audio renderer 700 is configured to: when calculating the direct response information, derive the direct response vector of one or more audio objects, and for one or more audio objects, calculate the covariance matrix according to each direct response vector. In addition, when calculating the covariance synthesis information, the target covariance matrix is calculated. However, relevant information of the target covariance matrix (i.e., the direct response matrix or vector of one or more most important objects, and the diagonal matrix of the direct power indicated as E determined by applying the power ratio) can be used instead of the target covariance matrix.

因此，目标协方差信息不一定必须是显式目标协方差矩阵，而是从一个音频对象的协方差矩阵或时间/频率区间中更多音频对象的协方差矩阵中、从关于时间/频率区间中的相应一个或多个音频对象的功率信息以及从用于一个或多个时间/频率区间的一个或多个传输通道导出的功率信息中导出的。Therefore, the target covariance information does not necessarily have to be an explicit target covariance matrix, but is derived from the covariance matrix of one audio object or the covariance matrices of more audio objects in a time/frequency interval, from power information about the corresponding one or more audio objects in the time/frequency interval, and from power information derived from one or more transmission channels for one or more time/frequency intervals.

比特流表示由解码器读取，并且编码传输通道和包含在其中的编码参数化辅助信息可用于进一步处理。参数化辅助信息包括：The bitstream representation is read by the decoder and the encoded transport channel and the encoded parametric auxiliary information contained therein are made available for further processing. The parametric auxiliary information includes:

·作为量化方位角和仰角值的方向信息(对于每个帧)Directional information as quantized azimuth and elevation values (for each frame)

·表示相关对象子集的对象索引(对于每个参数带)An object index representing a subset of related objects (for each parameter band)

·将相关对象彼此相关的量化功率比(对于每个参数带)Quantified power ratios relating related objects to each other (for each parameter band)

所有处理都是以逐帧方式进行的，其中每个帧包含一个或多个子帧。例如，帧可以由四个子帧组成，在这种情况下，一个子帧将具有5毫秒的持续时间。图4示出了解码器的简化概览。All processing is done in a frame-by-frame manner, where each frame contains one or more subframes. For example, a frame may consist of four subframes, in which case a subframe will have a duration of 5 milliseconds. Figure 4 shows a simplified overview of the decoder.

图4示出了实现第一方面和第二方面的音频解码器。图6a和图6b所示的输入接口600包括解复用器602、核心解码器604、用于对对象索引进行解码的解码器608、用于对功率比进行解码和去量化的解码器612、以及用于对612处指示的方向信息进行解码和去量化的解码器。此外，输入接口包括用于提供时间/频率表示形式的传输通道的滤波器组606。FIG4 shows an audio decoder implementing the first and second aspects. The input interface 600 shown in FIG6a and FIG6b comprises a demultiplexer 602, a core decoder 604, a decoder 608 for decoding an object index, a decoder 612 for decoding and dequantizing a power ratio, and a decoder for decoding and dequantizing the direction information indicated at 612. In addition, the input interface comprises a filter bank 606 for providing a transmission channel in a time/frequency representation.

音频渲染器700包括：直接响应计算器704；原型矩阵提供器702，由例如用户接口接收到的输出配置所控制；协方差合成块706；以及合成滤波器组708，以便最终提供包括通道输出格式的音频通道数量的输出音频文件。The audio renderer 700 comprises a direct response calculator 704, a prototype matrix provider 702, controlled by an output configuration received, for example, by a user interface, a covariance synthesis block 706, and a synthesis filter bank 708, so as to ultimately provide an output audio file including the number of audio channels in a channel output format.

因此，项目602、604、606、608、610、612优选地被包括在图6a和图6b的输入接口中，并且图4的项目702、704、706、708是图6a或图6b的以附图标记700指示的音频渲染器的部分。Hence, items 602, 604, 606, 608, 610, 612 are preferably comprised in the input interface of Figs. 6a and 6b and items 702, 704, 706, 708 of Fig. 4 are part of the audio renderer indicated with reference numeral 700 of Fig. 6a or 6b.

对编码参数化辅助信息进行解码，并重新获得量化功率比值、量化方位角值和量化仰角值(方向信息)、以及对象索引。未发送的一个功率比值是通过利用所有功率比值总和为1的事实来获得的。它们的分辨率(l，m)对应于编码器侧采用的时间/频率区的分组。在使用更精细的时间/频率分辨率(k，n)的进一步处理步骤期间，参数带的参数对该参数带中包含的所有时间/频率区有效，对应于使得(l，m)→(k，n)的扩展。The coding parameterized auxiliary information is decoded and quantized power ratio, quantized azimuth value and quantized elevation value (direction information) and object index are re-obtained. A power ratio value that is not sent is obtained by utilizing the fact that the sum of all power ratios is 1. Their resolution (l, m) corresponds to the grouping of the time/frequency region adopted by the encoder side. During the further processing step using a finer time/frequency resolution (k, n), the parameters of the parameter band are valid for all time/frequency regions included in the parameter band, corresponding to the expansion that makes (l, m) → (k, n).

编码传输通道由核心解码器进行解码。使用滤波器组(与编码器中使用的滤波器组相匹配)，将如此解码的音频信号的每个帧转换为时间/频率表示，该时间/频率表示的分辨率通常优于(但至少等于)用于参数化辅助信息的分辨率。The coded transport channel is decoded by the core decoder. Using a filter bank (matching the filter bank used in the encoder), each frame of the audio signal so decoded is converted into a time/frequency representation whose resolution is typically better than (but at least equal to) that used for the parametric side information.

输出信号渲染/合成Output signal rendering/compositing

以下描述适用于音频信号的一个帧；T表示转置运算符：The following description applies to one frame of the audio signal; T represents the transpose operator:

使用解码传输通道x＝x(k，n)＝[X₁(k，n)，X₂(k，n)]^T，即时间-频率表示形式的音频信号(在这种情况下包括两个传输通道)和参数化辅助信息，导出每个子帧(或用于降低计算复杂度的帧)的混合矩阵M以合成包括多个输出通道(例如，5.1、7.1、7.1+4等)的时间-频率输出信号y＝y(k，n)＝[Y₁(k，n)，Y₂(k，n)，Y₃(k，n)，...]^T：Using the decoded transmission channels x=x(k,n)=[ _X1 (k,n), _X2 (k,n)] ^T , i.e. the audio signal in time-frequency representation (comprising two transmission channels in this case) and the parametric side information, a mixing matrix M is derived for each subframe (or frame for reducing computational complexity) to synthesize a time-frequency output signal y=y(k,n)=[ _Y1 (k,n), _Y2 (k,n), _Y3 (k,n),...] ^T comprising a plurality of output channels (e.g. 5.1, 7.1, 7.1+4, etc.):

对于所有(输入)对象，使用所发送的对象方向，确定所谓的直接响应值，该直接响应值描述要用于输出通道的平移增益。这些直接响应值特定于目标布局，即扬声器的数量和位置(作为输出配置的一部分提供)。平移方法的示例包括向量基幅度平移(VBAP)[Pulkki 1997]和边缘衰落幅度平移(EFAP)[Borβ2014]。每个对象具有与之相关联的直接响应值的向量dr_i(包含与扬声器一样多的元素)。这些向量每帧被计算一次。注意：如果对象位置对应于扬声器位置，则该向量包含针对该扬声器的值1；所有其他值为0。如果对象位于两个(或三个)扬声器之间，则非零向量元素的对应数量为两个(或三个)。For all (input) objects, using the sent object directions, so-called direct response values are determined, which describe the panning gains to be used for the output channels. These direct response values are specific to the target layout, i.e. the number and positions of loudspeakers (provided as part of the output configuration). Examples of panning methods include Vector Basis Amplitude Panning (VBAP) [Pulkki 1997] and Edge Fading Amplitude Panning (EFAP) [Borβ 2014]. Each object has a vector dr _i of direct response values associated with it (containing as many elements as there are loudspeakers). These vectors are calculated once per frame. Note: If the object position corresponds to a loudspeaker position, the vector contains the value 1 for that loudspeaker; all other values are 0. If the object is located between two (or three) loudspeakers, the corresponding number of non-zero vector elements is two (or three).

实际合成步骤(在该实施例中，协方差合成[Vilkamo2013])包括以下子步骤(参见图5的可视化)：The actual synthesis step (in this example, covariance synthesis [Vilkamo2013]) consists of the following sub-steps (see Figure 5 for a visualization):

о对于每个参数带，描述分组到该参数带中的时间/频率区内的输入对象中的主要对象子集的对象索引用于提取进一步处理所需的向量子集dr_i。由于仅考虑例如2个相关对象，因此需要与这2个相关对象相关联的2个向量dr_i。o For each parameter band, the object index describing the main subset of objects in the input objects grouped into the time/frequency regions in this parameter band is used to extract the vector subset _dri required for further processing. Since only eg 2 relevant objects are considered, 2 vectors _dri associated with these 2 relevant objects are required.

о根据直接响应值dr_i，然后针对每个相关对象计算尺寸为输出通道*输出通道的协方差矩阵C_i：оBased on the direct response values _dri , the covariance matrix _Ci with the size of output channel * output channel is then calculated for each relevant object:

C_i＝dr_i*dr_i ^T _Ci = _dri * _dri ^T

о对于每个时间/频率区(在参数带内)，确定音频信号功率P(k，n)。在两个传输通道的情况下，将第一通道的信号功率与第二通道的信号功率相加。将这些功率比值中的每一个与该信号功率相乘，从而针对每个相关/主要对象i产生一个直接功率值：o For each time/frequency bin (within the parameter band), determine the audio signal power P(k,n). In the case of two transmission channels, add the signal power of the first channel to the signal power of the second channel. Multiply each of these power ratios by the signal power, resulting in a direct power value for each relevant/dominant object i:

DP_i(k，n)＝PR_i(k，n)*P(k，n)DP _i (k, n) = PR _i (k, n) * P (k, n)

o对于每个频带k，大小为输出通道*输出通道的最终目标协方差矩阵C_Y是通过对(子)帧内的所有时隙n进行求和以及对所有相关对象进行求和来得到的：o For each frequency band k, the final target covariance matrix C _Y of size output channels * output channels is obtained by summing over all time slots n within the (sub)frame and over all relevant objects:

图5示出了在图4的块706中执行的协方差合成步骤的详细概览。具体地，图5实施例包括信号功率计算块721、直接功率计算块722、协方差矩阵计算块73、目标协方差矩阵计算块724、输入协方差矩阵计算块726、混合矩阵计算块725和渲染块727，该渲染块727对于图5附加地包括图4的滤波器组块708，使得块727的输出信号优选地对应于时域输出信号。然而，当块708不被包括在图5的渲染块中时，则结果是对应音频通道的频谱域表示。Fig. 5 shows a detailed overview of the covariance synthesis step performed in block 706 of Fig. 4. Specifically, the embodiment of Fig. 5 includes a signal power calculation block 721, a direct power calculation block 722, a covariance matrix calculation block 73, a target covariance matrix calculation block 724, an input covariance matrix calculation block 726, a mixing matrix calculation block 725 and a rendering block 727, which additionally includes the filter bank block 708 of Fig. 4 for Fig. 5, so that the output signal of block 727 preferably corresponds to the time domain output signal. However, when block 708 is not included in the rendering block of Fig. 5, the result is a spectral domain representation of the corresponding audio channel.

(以下步骤是最先进的[Vilkamo2013]的一部分，并且是为了清楚起见而添加。)(The following steps are part of the state-of-the-art [Vilkamo2013] and are added for clarity.)

o对于每个(子)帧和每个频带，根据解码音频信号来计算大小为传输通道*传输通道的输入协方差矩阵C_x＝xx^T。可选地，可以仅使用主对角线的条目，在这种情况下，其他非零条目被设置为零。o For each (sub)frame and each frequency band, an input covariance matrix C _x = xx ^T of size transmission channel * transmission channel is calculated from the decoded audio signal. Optionally, only the entries of the main diagonal may be used, in which case the other non-zero entries are set to zero.

o定义了大小为输出通道*传输通道的原型矩阵，该原型矩阵描述了传输通道到输出通道的映射(作为输出配置的一部分提供)，其数量由目标输出格式(例如，目标扬声器布局)给出。该原型矩阵可以是静态的，也可以逐帧变化。示例：如果仅发送了单个传输通道，则该传输通道被映射到每个输出通道。如果发送了两个传输通道，则左(第一)通道被映射到位于(+0°，+180°)内位置处的所有输出通道，即“左”通道。右(第二)通道相应地映射到位于(-0°，-180°)内位置处的所有输出通道，即“右”通道。(注释：0°描述听者前方的位置，正角描述听者左侧的位置，以及负角描述听者右侧的位置。如果采用了不同约定，则需要相应地调整角度的符号。)o A prototype matrix of size output_channels * transmission_channels is defined, which describes the mapping of transmission channels to output channels (provided as part of the output configuration), whose number is given by the target output format (e.g., target loudspeaker layout). This prototype matrix can be static or can vary from frame to frame. Example: If only a single transmission channel is sent, then that transmission channel is mapped to each output channel. If two transmission channels are sent, then the left (first) channel is mapped to all output channels at positions within (+0°, +180°), i.e., the "left" channels. The right (second) channel is correspondingly mapped to all output channels at positions within (-0°, -180°), i.e., the "right" channels. (Note: 0° describes a position in front of the listener, positive angles describe positions to the left of the listener, and negative angles describe positions to the right of the listener. If a different convention is used, the signs of the angles need to be adjusted accordingly.)

o使用输入协方差矩阵C_x、目标协方差矩阵C_Y和原型矩阵，针对每个(子)帧和每个频带计算混合矩阵[Vilkamo2013]，导致每个(子)帧例如60个混合矩阵。o Using the input covariance matrix C _x , the target covariance matrix C _Y and the prototype matrix, a mixing matrix [Vilkamo2013] is calculated for each (sub)frame and each frequency band, resulting in, for example, 60 mixing matrices per (sub)frame.

o混合矩阵在(子)帧之间进行(例如，线性)内插，对应于时间平滑。o The mixing matrix is interpolated (eg linearly) between (sub)frames, corresponding to temporal smoothing.

o最后，通过将最后混合矩阵M集乘以尺寸输出通道*传输通道中的每一个，输出通道y被逐频带地合成到解码传输通道x的时间/频率表示的对应频带：oFinally, the output channels y are synthesized band-by-band to the corresponding bands of the time/frequency representation of the decoded transport channel x by multiplying the final set of mixing matrices M by each of the size outputchannels*transmitchannels:

y＝Mxy＝Mx

注意，我们没有使用如[Vilkamo2013]中所描述的残差信号r。Note that we do not use the residual signal r as described in [Vilkamo2013].

·使用滤波器组将输出信号y转换回时域表示y(t)。• Use a filter bank to convert the output signal y back to the time domain representation y(t).

经优化的协方差合成Optimized Covariance Synthesis

由于针对本实施例如何计算输入协方差矩阵C_x和目标协方差矩阵C_Y，因此可以实现使用来自[Vilkamo2013]的协方差合成对最佳混合矩阵计算的某些优化，从而导致显著降低了混合矩阵计算的计算复杂度。请注意，在本节中，Hadamard运算符。表示对矩阵的逐元素运算，即不是遵循例如矩阵乘法的规则，而是逐个元素地进行相应的运算。该运算符意味着：不对整个矩阵进行对应的运算，而是分别对每个元素进行运算。矩阵A和B的乘法将例如不对应于矩阵乘法AB＝C，而是对应于逐元素运算a_ij*b_ij＝c_ij。Due to how the input covariance matrix _Cx and the target covariance matrix C _Y are calculated for this embodiment, certain optimizations of the optimal mixing matrix calculation using covariance synthesis from [Vilkamo2013] can be achieved, resulting in a significant reduction in the computational complexity of the mixing matrix calculation. Please note that in this section, the Hadamard operator. represents an element-by-element operation on the matrix, that is, instead of following the rules of, for example, matrix multiplication, the corresponding operation is performed element by element. This operator means: the corresponding operation is not performed on the entire matrix, but on each element separately. The multiplication of matrices A and B will, for example, not correspond to the matrix multiplication AB=C, but to the element-by-element operation a_ij*b_ij=c_ij.

SVD(.)表示奇异值分解。来自[Vilkamo2013]的被呈现为Matlab函数(列表1)的算法如下(现有技术)：SVD(.) stands for singular value decomposition. The algorithm from [Vilkamo2013] presented as a Matlab function (Listing 1) is as follows (prior art):

输入：大小为m×m的矩阵C_x，包含输入信号的协方差Input: A matrix C _x of size m×m containing the covariance of the input signal

输入：大小为n×n的矩阵C_Y，包含输出信号的目标协方差Input: A matrix C _Y of size n×n containing the target covariance of the output signal

输入：大小为n×m的矩阵Q，原型矩阵Input: matrix Q of size n×m, prototype matrix

输入：标量α，针对S_x的正则化因子([Vilkamo2013]建议α＝0.2)Input: scalar α, regularization factor for S _x ([Vilkamo2013] recommends α = 0.2)

输入：标量β，针对的正则化因子([Vilkamo2013]建议β＝0.001)Input: scalar β, for Regularization factor ([Vilkamo2013] recommends β = 0.001)

输入：布尔值a，表示是否应执行能量补偿而不是计算残差协方差C_r Input: Boolean value a indicating whether energy compensation should be performed instead of calculating the residual covariance C _r

输出：大小为n×m的矩阵M，最佳混合矩阵Output: Matrix M of size n×m, the optimal mixing matrix

输出：大小为n×n的矩阵C_r，包含残差协方差Output: A matrix C _r of size n×n, containing the residual covariance

％C_Y的分解([Vilkamo2013])，等式(3))Decomposition of %C _Y ([Vilkamo2013]), equation (3))

如前一节所述，仅可选地使用C_x的主对角线元素，并且所有其他条目都被设置为零。在这种情况下，C_x是对角矩阵，并且满足[Vilkamo2013]的等式(3)的有效分解是As described in the previous section, only the main diagonal elements of C _x are optionally used, and all other entries are set to zero. In this case, C _x is a diagonal matrix, and the valid decomposition satisfying equation (3) of [Vilkamo2013] is

并且不再需要来自现有技术算法的行3的SVD。And the SVD of line 3 from the prior art algorithm is no longer needed.

考虑用于从直接响应dr_i和前一节中的直接功率(或直接能量)生成目标协方差的公式Consider the formula for generating the target covariance from the direct response _dri and the direct power (or direct energy) in the previous section

C_i＝dr_i*dr_i ^T _Ci = _dri * _dri ^T

DP_i(k，n)＝PR_i(k，n)*P(k，n)DP _i (k, n) = PR _i (k, n) * P (k, n)

最后一个公式可以被重新布置并写成The last formula can be rearranged and written as

如果我们现在定义If we now define

从而获得Thus obtaining

可以容易看出，如果我们布置k个最主要对象的直接响应矩阵R＝[dr₁…dr_k]中的直接响应，并创建直接功率的对角矩阵作为E，其中e_i，i＝E_i，C_Y也可以表示为It can be easily seen that if we arrange the direct responses of the k most dominant objects in the direct response matrix R = [dr ₁ ...dr _k ] and create a diagonal matrix of direct powers as E, where e _i,i = E _i , C _Y can also be expressed as

C_Y＝RER^H C _Y ＝RER ^H

并且满足[Vilkamo2013]的等式(3)的C_Y的有效分解由以下公式给出：And the effective decomposition of C _Y that satisfies equation (3) of [Vilkamo2013] is given by:

因此，不再需要来自现有技术算法的行1的SVD。Therefore, the SVD of row 1 from the prior art algorithm is no longer needed.

这导致本实施例内用于协方差合成的优化算法，该优化算法还考虑到我们始终使用能量补偿选项并因此不需要残差目标协方差C_r：This leads to an optimization algorithm for covariance synthesis within the present embodiment, which also takes into account that we always use the energy compensation option and therefore do not need a residual target covariance _Cr :

输入：大小为m×m的对角矩阵C_x，包含具有m个通道的输入信号的协方差Input: A diagonal matrix C _x of size m×m containing the covariance of the input signal with m channels

输入：大小为n×k的矩阵R，包含对k个主要对象的直接响应Input: A matrix R of size n×k containing the direct responses to the k main subjects

输入：对角矩阵E，包含主要对象的目标功率Input: Diagonal matrix E containing the target powers of the main objects

输入：标量α，S_x(的正则化因子[Vilkamo2013]建议α＝0.2)Input: scalar α, regularization factor S _x ([Vilkamo2013] recommends α = 0.2)

输入：标量β，(的正则化因子[Vilkamo2013]建议β＝0.001)Input: scalar β, (Regularization factor [Vilkamo2013] recommends β = 0.001)

％C_Y的分解(创新性步骤)%C _Y decomposition (innovative step)

仔细比较现有算法和所提出的算法表明：前者需要大小分别为m×m、n×n和m×n的矩阵的三个SVD，其中，m是下混通道的数量，并且n是对象所渲染到的输出通道的数量。A careful comparison of existing algorithms and the proposed algorithm shows that the former requires three SVDs of matrices of size m×m, n×n, and m×n, respectively, where m is the number of downmix channels and n is the number of output channels to which the object is rendered.

所提出的算法仅需要大小为m×k的矩阵的一个SVD，其中，k是主要对象的数量。此外，由于k通常比n小得多，因此该矩阵小于来自现有技术算法的对应矩阵。The proposed algorithm requires only one SVD of a matrix of size m×k, where k is the number of primary objects. Furthermore, since k is typically much smaller than n, this matrix is smaller than the corresponding matrix from the prior art algorithm.

对于m×n矩阵[Golub2013]，标准SVD实现的复杂度大致为O(c₁m²n+c₂n³)，其中，c₁和c₂是取决于所使用算法的常数。因此，与现有技术算法相比，所提出算法的计算复杂度显著降低。For m×n matrices [Golub2013], the complexity of the standard SVD implementation is roughly O(c ₁ m ² n + c ₂ n ³ ), where c ₁ and c ₂ are constants that depend on the algorithm used. Therefore, the computational complexity of the proposed algorithm is significantly reduced compared to the state-of-the-art algorithms.

随后，关于图7a、图7b讨论了与第一方面的编码器侧相关的优选实施例。此外，关于图9a至图9d讨论了第二方面的编码器侧实现的优选实现。Subsequently, preferred embodiments related to the encoder side of the first aspect are discussed with respect to Figures 7a, 7b. Furthermore, preferred implementations of the encoder side implementation of the second aspect are discussed with respect to Figures 9a to 9d.

图7a示出了图1a的对象参数计算器100的优选实现。在块120中，将音频对象转换为频谱表示。这是由图2或图3的滤波器组102实现的。然后，在块122中，选择信息的计算例如如图2或图3的块104所示。为此，可以使用幅度相关测量值，例如幅度本身、功率、能量或通过将幅度提高到不同于1的功率而获得的任何其他幅度相关测量值。块122的结果是对应时间/频率区间中的每个对象的选择信息集。然后，在块124中，导出每个时间/频率区间的对象ID。在第一方面中，导出每个时间/频率区间的两个或更多个对象ID。根据第二方面，每个时间/频率区间的对象ID的数量甚至可以仅为单个对象ID，使得在块124中识别由块122提供的信息中最重要或最强或最相关的对象。块124输出关于参数数据的信息，并且包括最相关的一个或多个对象的单个或多个索引。FIG. 7 a shows a preferred implementation of the object parameter calculator 100 of FIG. 1 a. In block 120, the audio object is converted into a spectral representation. This is achieved by the filter bank 102 of FIG. 2 or FIG. 3. Then, in block 122, the calculation of the selection information is, for example, as shown in block 104 of FIG. 2 or FIG. 3. For this purpose, an amplitude-related measurement value can be used, such as the amplitude itself, power, energy, or any other amplitude-related measurement value obtained by increasing the amplitude to a power different from 1. The result of block 122 is a set of selection information for each object in the corresponding time/frequency interval. Then, in block 124, an object ID for each time/frequency interval is derived. In a first aspect, two or more object IDs for each time/frequency interval are derived. According to a second aspect, the number of object IDs for each time/frequency interval can even be only a single object ID, so that the most important or strongest or most relevant object in the information provided by block 122 is identified in block 124. Block 124 outputs information about parameter data and includes a single or multiple indexes of the most relevant one or more objects.

在每个时间/频率区间具有两个或更多个相关对象的情况下，块126的功能对于计算表征时间/频率区间中的对象的幅度相关测量值是有用的。该幅度相关测量值可以与已经在块122中针对选择信息计算的幅度相关测量值相同，或者优选地，组合值是使用由块102已经计算的信息来计算的，如块122和块126之间的虚线所示，然后幅度相关测量值或一个或多个组合值在块126中被计算并转发给量化器和编码器块212，以便在辅助信息中具有编码幅度相关值或编码组合值，作为附加参数化辅助信息。在图2或图3的实施例中，这些值是与“编码对象索引”一起被包括在比特流中的“编码功率比”。在每个频率区间仅具有单个对象ID的情况下，功率比计算和量化编码不是必需的，并且时间频率区间中的最相关对象的索引足以执行解码器侧渲染。In the case of two or more related objects per time/frequency interval, the functionality of block 126 is useful for calculating amplitude correlation measures characterizing the objects in the time/frequency interval. This amplitude correlation measure can be the same as the amplitude correlation measure already calculated for the selection information in block 122, or preferably, a combined value is calculated using information already calculated by block 102, as shown by the dotted line between block 122 and block 126, and then the amplitude correlation measure or one or more combined values are calculated in block 126 and forwarded to the quantizer and encoder block 212 so as to have an encoded amplitude correlation value or an encoded combined value in the auxiliary information as additional parameterized auxiliary information. In the embodiments of Figures 2 or 3, these values are "coded power ratios" included in the bitstream together with the "coded object index". In the case of only a single object ID per frequency interval, power ratio calculation and quantization encoding are not necessary, and the index of the most relevant object in the time-frequency interval is sufficient to perform decoder-side rendering.

图7b示出了图7b的对选择信息102的计算的优选实现。如块123所示，针对每个对象和每个时间/频率区间计算信号功率作为选择信息。然后，在示出图7a的块124的优选实现的块125中，提取并输出具有最高功率的单个或优选地两个或更多个对象的对象ID。此外，在两个或更多个相关对象的情况下，功率比的计算如作为块126的优选实现的块127所示，其中，功率比是针对提取对象ID计算的，该提取对象ID与具有由块125找到的对应对象ID的所有提取对象的功率相关。该过程是有利的，因为仅必须发送数量比时间/频率区间的对象数量少1的组合值，因为在该实施例中存在解码器已知的规则，该规则规定所有对象的功率比必须总和为1。优选地，图7a的块120、122、124、126和/或图7b的123、125、127的功能由图1a的对象参数计算器100来实现，并且图7a的块212的功能由图1a的输出接口200来实现。FIG. 7 b shows a preferred implementation of the calculation of the selection information 102 of FIG. 7 b. As shown in block 123, the signal power is calculated as the selection information for each object and each time/frequency interval. Then, in block 125 showing a preferred implementation of block 124 of FIG. 7 a, the object ID of a single or preferably two or more objects with the highest power is extracted and output. In addition, in the case of two or more related objects, the calculation of the power ratio is shown in block 127 as a preferred implementation of block 126, wherein the power ratio is calculated for the extracted object ID, which is related to the power of all extracted objects with the corresponding object ID found by block 125. This process is advantageous because only the combined values whose number is less than 1 of the number of objects in the time/frequency interval must be sent, because in this embodiment there is a rule known to the decoder, which stipulates that the power ratios of all objects must sum to 1. Preferably, the functions of blocks 120, 122, 124, 126 of FIG. 7a and/or 123, 125, 127 of FIG. 7b are implemented by the object parameter calculator 100 of FIG. 1a, and the function of block 212 of FIG. 7a is implemented by the output interface 200 of FIG. 1a.

随后，关于若干个实施例更详细地说明了图1b所示的根据第二方面的用于编码的装置。在步骤110a中，方向信息或者从输入信号中提取，例如如图12a所示，或者通过读取或解析包括在元数据部分或元数据文件中的元数据信息来提取。在步骤110b中，对每帧和音频对象的方向信息进行量化，并且将每帧每对象的量化索引转发给编码器或输出接口，例如图1b的输出接口200。在步骤110c中，对方向量化索引进行去量化以便具有在某些实现中也可以由块110b直接输出的去量化值。然后，基于去量化方向索引，块422基于某个虚拟麦克风设置来计算每个传输通道和每个对象的权重。该虚拟麦克风设置可以包括布置在相同位置处并具有不同取向的两个虚拟麦克风信号，或者可以是如下设置：其中，存在相对于参考位置或取向(例如，虚拟听者位置或取向)的两个不同位置。具有两个虚拟麦克风信号的设置将导致每个对象的两个传输通道的权重。Subsequently, the device for encoding according to the second aspect shown in FIG. 1b is described in more detail with respect to several embodiments. In step 110a, the directional information is either extracted from the input signal, such as shown in FIG. 12a, or extracted by reading or parsing the metadata information included in the metadata part or metadata file. In step 110b, the directional information of each frame and audio object is quantized, and the quantization index of each frame and each object is forwarded to the encoder or output interface, such as the output interface 200 of FIG. 1b. In step 110c, the directional quantization index is dequantized so as to have a dequantized value that can also be directly output by block 110b in some implementations. Then, based on the dequantized directional index, block 422 calculates the weight of each transmission channel and each object based on a certain virtual microphone setting. The virtual microphone setting may include two virtual microphone signals arranged at the same position and having different orientations, or may be a setting in which there are two different positions relative to a reference position or orientation (e.g., a virtual listener position or orientation). A setting with two virtual microphone signals will result in weights for two transmission channels for each object.

在生成三个传输通道的情况下，虚拟麦克风设置可以被认为包括来自在相同位置处且具有不同取向、或在相对于参考位置或取向的三个不同位置处布置的麦克风的三个虚拟麦克风信号，其中，该参考位置或取向可以是虚拟听者位置或取向。In case three transmission channels are generated, the virtual microphone setup may be considered to include three virtual microphone signals from microphones arranged at the same position and with different orientations, or at three different positions relative to a reference position or orientation, where the reference position or orientation may be a virtual listener position or orientation.

备选地，可以基于虚拟麦克风设置来生成四个传输通道，该虚拟麦克风设置从布置在相同位置处且具有不同取向的麦克风生成四个虚拟麦克风信号，或者从布置在相对于参考位置或参考取向的四个不同位置处的麦克风生成四个虚拟麦克风信号，其中参考位置或取向可以是虚拟听者位置或虚拟听者取向。Alternatively, four transmission channels may be generated based on a virtual microphone setup that generates four virtual microphone signals from microphones arranged at the same position and with different orientations, or that generates four virtual microphone signals from microphones arranged at four different positions relative to a reference position or a reference orientation, wherein the reference position or orientation may be a virtual listener position or a virtual listener orientation.

此外，为了计算每个对象和每个传输通道(以两个通道为例)的权重wL和wR，虚拟麦克风信号是从虚拟一阶麦克风或虚拟心形麦克风或虚拟8字形麦克风或偶极子麦克风或双向麦克风导出的信号，或者是从虚拟定向麦克风或从虚拟亚心形麦克风或从虚拟单向麦克风或从虚拟超心形麦克风或从虚拟全向麦克风导出的信号。Furthermore, in order to calculate the weights wL and wR for each object and each transmission channel (taking two channels as an example), the virtual microphone signal is a signal derived from a virtual first-order microphone or a virtual cardioid microphone or a virtual figure-8 microphone or a dipole microphone or a bidirectional microphone, or is a signal derived from a virtual directional microphone or from a virtual sub-cardioid microphone or from a virtual unidirectional microphone or from a virtual supercardioid microphone or from a virtual omnidirectional microphone.

在该上下文中，应当注意，为了计算权重，不需要实际麦克风的任何放置。相反，用于计算权重的规则取决于虚拟麦克风设置(即，虚拟麦克风的放置和虚拟麦克风的特性)而变化。In this context, it should be noted that for calculating the weights, no placement of the actual microphones is required. Instead, the rules for calculating the weights vary depending on the virtual microphone setup, ie the placement of the virtual microphones and the characteristics of the virtual microphones.

在图9a的块404中，将权重应用于对象，使得对于每个对象，在权重不同于0的情况下获得对象对特定传输通道的贡献。因此，块404接收对象信号作为输入。然后，在块406中，对每个传输通道的贡献求和，使得例如对象对第一传输通道的贡献被加在一起并且对象对第二传输通道的贡献被加在一起，等等。如块406所示，然后，块406的输出在例如时域中是传输通道。In block 404 of FIG. 9 a, weights are applied to the objects so that for each object, the contribution of the object to a particular transmission channel is obtained with a weight different from 0. Thus, block 404 receives as input the object signal. Then, in block 406, the contribution of each transmission channel is summed so that, for example, the contribution of the object to a first transmission channel is added together and the contribution of the object to a second transmission channel is added together, etc. As shown in block 406, the output of block 406 is then the transmission channel, for example in the time domain.

优选地，输入到块404中的对象信号是具有全带信息的时域对象信号，并且块404中的应用和块406中的求和是在时域中执行的。然而，在其他实施例中，这些步骤也可以在谱域中执行。Preferably, the object signal input to block 404 is a time domain object signal with full band information, and the applying in block 404 and the summing in block 406 are performed in the time domain. However, in other embodiments, these steps may also be performed in the spectral domain.

图9b示出了实现静态下混的另一实施例。为此，在块130中提取第一帧的方向信息，并且取决于第一帧来计算权重，如块403a所示。然后，对于块408中所指示的其他帧，权重保持原样以实现静态下混。Figure 9b shows another embodiment of implementing a static downmix. To this end, the directional information of the first frame is extracted in block 130, and the weights are calculated depending on the first frame, as shown in block 403a. Then, for the other frames indicated in block 408, the weights remain as they are to implement a static downmix.

图9c示出了计算动态下混的另一种实现。为此，块132提取每个帧的方向信息，并且针对每个帧更新权重，如块403b所示。然后，在块405中，将经更新的权重应用于帧，以实现从帧到帧变化的动态下混。图9b和图9c的那些极端情况之间的其他实现也是有用的，其中，例如，仅针对每第二、三或每第n帧更新权重，和/或执行随时间的权重平滑，使得出于根据方向信息进行下混的目的，天线特性不会时不时地改变太多。图9d示出了由图1b的对象方向信息提供器110控制的下混器400的另一实现。在块410中，下混器被配置为分析帧中所有对象的方向信息，并且在块112中，出于计算立体声示例的权重w_L和w_R的目的，麦克风被放置为与分析结果一致，其中麦克风的放置是指麦克风位置和/或麦克风方向。在块414中，类似于关于图9b的块408所讨论的静态下混，将麦克风留给其他帧，或者根据关于图9c的块405所讨论的内容来更新麦克风，以便获得图9d的块414的功能。关于块412的功能，麦克风可以被放置为使得获得良好的分离，从而使得第一虚拟麦克风“看”向第一组对象并且第二虚拟麦克风“看”向第二组对象，第二组对象与第一组对象不同，并且优选地，不同之处在于，一个组中的任何对象尽可能地不被包括在另一组中。备选地，块410的分析可以通过其他参数来增强，并且放置也可以通过其他参数来控制。Fig. 9c shows another implementation of calculating dynamic downmixing. To this end, block 132 extracts the directional information of each frame, and updates the weight for each frame, as shown in block 403b. Then, in block 405, the updated weight is applied to the frame to realize the dynamic downmixing that changes from frame to frame. Other implementations between those extreme cases of Fig. 9b and Fig. 9c are also useful, wherein, for example, weights are updated only for every second, third or every nth frame, and/or weight smoothing over time is performed so that for the purpose of downmixing according to directional information, antenna characteristics will not change too much from time to time. Fig. 9d shows another implementation of the downmixer 400 controlled by the object direction information provider 110 of Fig. 1b. In block 410, the downmixer is configured to analyze the directional information of all objects in the frame, and in block 112, for the purpose of calculating the weights _wL and _wR of the stereo example, the microphone is placed to be consistent with the analysis result, wherein the placement of the microphone refers to the microphone position and/or the microphone direction. In block 414, similar to the static downmix discussed with respect to block 408 of FIG. 9b, the microphones are left for other frames, or updated as discussed with respect to block 405 of FIG. 9c, so as to obtain the functionality of block 414 of FIG. 9d. With respect to the functionality of block 412, the microphones may be placed so that good separation is obtained, so that the first virtual microphone "looks" at the first group of objects and the second virtual microphone "looks" at the second group of objects, the second group of objects being different from the first group of objects, and preferably different in that any object in one group is as little as possible from being included in the other group. Alternatively, the analysis of block 410 may be enhanced by other parameters, and the placement may also be controlled by other parameters.

随后，根据第一方面或第二方面并关于例如图6a和图6b所讨论的解码器的优选实现由下图10a、图10b、图10c、图10d和图11给出。Subsequently, preferred implementations of the decoder according to the first aspect or the second aspect and discussed with respect to, for example, Figs. 6a and 6b are given by the following Figs. 10a, 10b, 10c, 10d and 11.

在块613中，输入接口600被配置为获取与对象ID相关联的单独对象方向信息。该过程对应于图4或图5的块612的功能，并且导致如关于图8b并且特别是8c所示出和讨论的“帧的码本”。In block 613, input interface 600 is configured to obtain individual object direction information associated with the object ID. This process corresponds to the functionality of block 612 of Figure 4 or Figure 5, and results in a "codebook of frames" as shown and discussed with respect to Figures 8b and particularly 8c.

此外，在块609中，获取每个时间/频率区间的一个或多个对象ID，而不管这些数据对于低分辨率参数带或高分辨率频率区是否可用。对应于图4中的块608的过程的块609的结果是在时间/频率区间中的一个或多个相关对象的特定ID。然后，在块611中，从“帧的码本”(即，从图8c所示的示例性表)中获取每个时间/频率区间的特定一个或多个ID的特定对象方向信息。然后，在块704中，针对每个时间/频率区间，计算由输出格式控制的各个输出通道的一个或多个相关对象的增益值。然后，在块730或706、708中，计算输出通道。对输出通道的计算的功能可以如图10b所示在对一个或多个传输通道的贡献的显式计算中进行，或者可以如图10d或图11所示通过对传输通道贡献的间接计算和使用来进行。图10b示出了在与图4的功能相对应的块610中获取功率值或功率比的功能。然后，将这些功率值应用于每个相关对象的各个传输通道，如块733和735所示。此外，除了由块704确定的增益值之外，这些功率值还被应用于各个传输通道，使得块733、735产生传输通道(例如，传输通道ch1、ch2……)的对象特定贡献。然后，在块737中，针对每个时间/频率区间的每个输出通道将这些显式计算的通道传输贡献加在一起。In addition, in block 609, one or more object IDs of each time/frequency interval are obtained, regardless of whether these data are available for low-resolution parameter bands or high-resolution frequency areas. The result of block 609 corresponding to the process of block 608 in Fig. 4 is the specific ID of one or more related objects in the time/frequency interval. Then, in block 611, the specific object direction information of the specific one or more IDs of each time/frequency interval is obtained from "the codebook of the frame" (that is, from the exemplary table shown in Fig. 8c). Then, in block 704, for each time/frequency interval, the gain value of one or more related objects of each output channel controlled by the output format is calculated. Then, in block 730 or 706, 708, the output channel is calculated. The function of calculating the output channel can be performed in the explicit calculation of the contribution to one or more transmission channels as shown in Figure 10b, or can be performed by indirect calculation and use of the contribution to the transmission channel as shown in Figure 10d or Figure 11. Figure 10b shows the function of obtaining power value or power ratio in the block 610 corresponding to the function of Fig. 4. These power values are then applied to the respective transmission channels of each relevant object, as shown in blocks 733 and 735. Furthermore, these power values are applied to the respective transmission channels in addition to the gain values determined by block 704, so that blocks 733, 735 produce object-specific contributions of the transmission channels (e.g., transmission channels ch1, ch2, ...). These explicitly calculated channel transmission contributions are then added together in block 737 for each output channel of each time/frequency bin.

然后，取决于实现，可以提供扩散信号计算器741，该扩散信号计算器741在对应时间/频率区间中生成针对每个输出通道chl、ch2……的扩散信号，并且扩散信号和块737的贡献结果的组合被组合，使得获得每个时间/频率区间中的完整通道贡献。当协方差合成附加地依赖于扩散信号时，该信号对应于图4的滤波器组708的输入。然而，当协方差合成706不依赖于扩散信号而仅依赖于没有任何解相关器的处理时，则至少每个时间/频率区间的输出信号的能量对应于在图10b的块739的输出处的通道贡献的能量。此外，在不使用扩散信号计算器741的情况下，则块739的结果对应于块706的结果，即具有每个时间/频率区间的可以针对每个输出通道chl、ch2单独转换的完整通道贡献，以便最终获得具有时域输出通道的输出音频文件，该输出音频文件可以被存储或转发给扬声器或任何类型的渲染设备。Then, depending on the implementation, a diffuse signal calculator 741 may be provided, which generates a diffuse signal for each output channel chl, ch2, ... in the corresponding time/frequency interval, and the combination of the diffuse signal and the contribution results of block 737 is combined so that the complete channel contribution in each time/frequency interval is obtained. When the covariance synthesis additionally depends on the diffuse signal, the signal corresponds to the input of the filter bank 708 of Figure 4. However, when the covariance synthesis 706 does not rely on the diffuse signal but only relies on the processing without any decorrelator, then at least the energy of the output signal of each time/frequency interval corresponds to the energy of the channel contribution at the output of block 739 of Figure 10b. In addition, without using the diffuse signal calculator 741, the result of block 739 corresponds to the result of block 706, i.e., the complete channel contribution with each time/frequency interval that can be converted separately for each output channel chl, ch2, so as to finally obtain an output audio file with time domain output channels, which can be stored or forwarded to a loudspeaker or any type of rendering device.

图10c示出了图10b或图4的块610的功能的优选实现。在步骤610a中，针对某个时间/频率区间获取组合(功率)值或若干个值。在块610b中，基于所有组合值必须总和为l的计算规则，计算时间/频率区间中的其他相关对象的对应其他值。Figure 10c shows a preferred implementation of the functionality of the block 610 of Figure 10b or Figure 4. In step 610a, a combined (power) value or values are obtained for a certain time/frequency interval. In block 610b, corresponding other values of other related objects in the time/frequency interval are calculated based on the calculation rule that all combined values must sum to 1.

然后，结果将优选地是低分辨率表示，其中，针对每个分组时隙索引和每个参数带索引，低分辨率表示具有两个功率比。这些功率比表示低时间/频率分辨率。在块610c中，时间/频率分辨率可以扩展到高时间/频率分辨率，使得其具有高分辨率时隙索引n和高分辨率频带索引k的时间/频率区的功率值。该扩展可以包括对分组时隙内的对应时隙和参数带内的对应频带直接使用一个且相同的低分辨率索引。The result will then preferably be a low resolution representation with two power ratios for each packet slot index and each parameter band index. These power ratios represent low time/frequency resolution. In block 610c, the time/frequency resolution may be extended to a high time/frequency resolution such that it has power values for a time/frequency region with a high resolution slot index n and a high resolution band index k. The extension may include directly using one and the same low resolution index for corresponding slots within a packet slot and corresponding bands within a parameter band.

图10d示出了用于计算图4的块706中的协方差合成信息的功能的优选实现，由用于将两个或更多个输入传输通道混合成两个或更多个输出信号的混合矩阵725来表示。因此，当具有例如两个传输通道和六个输出通道时，每个单独的时间/频率区间的混合矩阵的大小将是六行和两列。在与图5中的块723的功能相对应的块723中，接收每个时间/频率区间中每个对象的增益值或直接响应值，并且计算协方差矩阵。在块722中，接收功率值或功率比，并计算时间/频率区间中每个对象的直接功率值，并且图10d中的块722对应于图5的块722。Figure 10d shows a preferred implementation of the function for calculating the covariance synthesis information in the block 706 of Figure 4, represented by a mixing matrix 725 for mixing two or more input transmission channels into two or more output signals. Therefore, when there are, for example, two transmission channels and six output channels, the size of the mixing matrix of each individual time/frequency interval will be six rows and two columns. In a block 723 corresponding to the function of the block 723 in Figure 5, the gain value or direct response value of each object in each time/frequency interval is received, and the covariance matrix is calculated. In block 722, power values or power ratios are received, and the direct power value of each object in the time/frequency interval is calculated, and the block 722 in Figure 10d corresponds to the block 722 of Figure 5.

将块721和722两者的结果都输入到目标协方差矩阵计算器724中。附加地或备选地，目标协方差矩阵C_y的显式计算不是必需的。相反，将包括在目标协方差矩阵中的相关信息(即，针对两个或更多个相关对象，在矩阵R中指示的直接响应值信息和在矩阵E中指示的直接功率值)输入到用于计算每个时间/频率区间的混合矩阵的块725a中。此外，混合矩阵725a接收关于从与图5的块726相对应的块726中所示的两个或更多个传输通道导出的输入协方差矩阵C_x以及原型矩阵Q的信息。可以对每个时间/频率区间和帧的混合矩阵进行时间平滑，如块725b所示，并且在与图5的渲染块的至少一部分相对应的块727中，将混合矩阵以非平滑或平滑形式应用于对应时间/频率区间中的传输通道，以获得时间/频率区间中的完整通道贡献，基本类似于之前关于图10b在块739的输出处所讨论的对应完整贡献。因此，图10b示出了传输通道贡献的显式计算的实现，而图10d示出了经由目标协方差矩阵C_y或经由块723和722的直接引入到混合矩阵计算块725a中的相关信息R和E来隐式地计算每个时间/频率区间和每个时间频率区间中的每个相关对象的传输通道贡献的过程。The results of both blocks 721 and 722 are input to the target covariance matrix calculator 724. Additionally or alternatively, the explicit calculation of the target covariance matrix _Cy is not necessary. On the contrary, the relevant information included in the target covariance matrix (that is, for two or more related objects, the direct response value information indicated in the matrix R and the direct power value indicated in the matrix E) is input to the block 725a of the mixing matrix for calculating each time/frequency interval. In addition, the mixing matrix 725a receives information about the input covariance matrix _Cx and the prototype matrix Q derived from the two or more transmission channels shown in the block 726 corresponding to the block 726 of Fig. 5. The mixing matrix for each time/frequency bin and frame may be temporally smoothed, as shown in block 725b, and in block 727 corresponding to at least a portion of the rendering block of FIG5, the mixing matrix is applied in a non-smoothed or smoothed form to the transmission channels in the corresponding time/frequency bins to obtain the complete channel contributions in the time/frequency bins, substantially similar to the corresponding complete contributions discussed previously with respect to FIG10b at the output of block 739. Thus, FIG10b illustrates an implementation of explicit calculation of the transmission channel contributions, while FIG10d illustrates a process of implicitly calculating the transmission channel contributions for each time/frequency bin and each relevant object in each time-frequency bin via the target covariance matrix _Cy or via the relevant information R and E directly introduced into the mixing matrix calculation block 725a of blocks 723 and 722.

随后，关于图11示出了用于协方差合成的优选优化算法。需要概述的是，图11所示的所有步骤都是在图4的协方差合成706内或在图5的混合矩阵计算块725或图10d中的725a内计算的。在步骤751中，计算第一分解结果K_y。由于如下事实：如图10d所示，直接使用包括在矩阵R中的增益值信息和来自两个或更多个相关对象的信息，具体地包括在矩阵ER中的直接功率信息，而无需显式计算协方差矩阵，因此可以容易地计算该分解结果。因此，可以直接计算块751中的第一分解结果并且无需太多工作量，因为不再需要特定奇异值分解。Subsequently, a preferred optimization algorithm for covariance synthesis is shown with respect to FIG. 11. It should be summarized that all steps shown in FIG. 11 are calculated in the covariance synthesis 706 of FIG. 4 or in the mixed matrix calculation block 725 of FIG. 5 or in 725a of FIG. 10d. In step 751, the first decomposition result _Ky is calculated. Due to the fact that, as shown in FIG. 10d, the gain value information included in the matrix R and the information from two or more related objects, specifically the direct power information included in the matrix ER, are directly used without explicitly calculating the covariance matrix, so the decomposition result can be easily calculated. Therefore, the first decomposition result in the block 751 can be directly calculated and without too much workload, because a specific singular value decomposition is no longer required.

在步骤752中，将第二分解结果计算为K_x。由于输入协方差矩阵被视为忽略了非对角线元素的对角矩阵，因此也可以在没有显式奇异值分解的情况下计算该分解结果。In step 752, the second decomposition result is calculated as K _x . Since the input covariance matrix is regarded as a diagonal matrix with off-diagonal elements ignored, the decomposition result can also be calculated without explicit singular value decomposition.

然后，在步骤753中，计算基于第一正则化参数α的第一正则化结果，并且在步骤754中，基于第二正则化参数β来计算第二正则化结果。由于K_x在优选实现中是对角矩阵，因此相对于现有技术简化了对第一正则化结果753的计算，因为S_x的计算仅是参数变化，而不是如现有技术中的分解。Then, in step 753, a first regularization result based on the first regularization parameter α is calculated, and in step 754, a second regularization result is calculated based on the second regularization parameter β. Since _Kx is a diagonal matrix in the preferred implementation, the calculation of the first regularization result 753 is simplified relative to the prior art, because the calculation of _Sx is only a parameter change, rather than a decomposition as in the prior art.

此外，对于块754中的对第二正则化结果的计算，第一步骤附加地仅是参数重命名而不是现有技术中与矩阵U_x ^HS的相乘。Furthermore, for the calculation of the second regularization result in block 754, the first step is additionally only parameter renaming instead of multiplication with the matrix U _x ^HS in the prior art.

此外，在步骤755中，计算归一化矩阵G^y，并且基于步骤755，在步骤756中基于K_x和原型矩阵Q以及由块751获得的K_y的信息来计算酉矩阵P。由于这里不需要任何矩阵Λ，因此相对于可用的现有技术简化了对酉矩阵P的计算。Furthermore, in step 755, the normalized matrix ^Gy is calculated, and based on step 755, in step 756, the unitary matrix P is calculated based on _Kx and the prototype matrix Q and the information of _Ky obtained by block 751. Since no matrix Λ is required here, the calculation of the unitary matrix P is simplified relative to the available prior art.

然后，在步骤757中，计算没有能量补偿的混合矩阵，即M_opt，并且为此，使用酉矩阵P、块754的结果和块751的结果。然后，在块758中，使用补偿矩阵G来执行能量补偿。执行能量补偿使得不需要从解相关器导出的任何残差信号。然而，代替执行能量补偿，在该实现中将添加具有足够大的能量以填充由混合矩阵M_opt留下的能隙而没有能量信息的残差信号。然而，出于本发明的目的，不依赖解相关信号以避免由解相关器引入的任何伪音。但是，如步骤758所示的能量补偿是优选的。Then, in step 757, the mixing matrix without energy compensation, i.e. M _opt , is calculated, and for this purpose, the unitary matrix P, the result of block 754 and the result of block 751 are used. Then, in block 758, energy compensation is performed using the compensation matrix G. Energy compensation is performed so that no residual signal derived from the decorrelator is required. However, instead of performing energy compensation, in this implementation a residual signal with sufficient energy to fill the energy gap left by the mixing matrix M _opt without energy information will be added. However, for the purpose of the present invention, the decorrelated signal is not relied on to avoid any artifacts introduced by the decorrelator. However, energy compensation as shown in step 758 is preferred.

因此，用于协方差合成的优化算法在步骤751、752、753、754中以及在步骤756内针对酉矩阵P的计算提供了优点。需要强调的是，优化算法甚至提供了优于现有技术的优点，其中仅步骤755、752、753、754、756之一或仅这些步骤的子组被实现，如图所示，但对应的其他步骤与现有技术一样被实现。原因是这些改进并不彼此依赖，而是可以彼此独立应用。然而，就实现的复杂度而言，实现的改进越多，该过程将越好。因此，图11实施例的完整实现是优选的，因为它提供了最大程度的复杂度降低，但即使根据优化算法仅实现步骤751、752、753、754、756之一并且其他步骤如现有技术一样实现，也获得了复杂度降低而没有任何质量劣化。Therefore, the optimization algorithm for covariance synthesis provides advantages for the calculation of the unitary matrix P in steps 751, 752, 753, 754 and in step 756. It should be emphasized that the optimization algorithm even provides advantages over the prior art, wherein only one of steps 755, 752, 753, 754, 756 or only a subgroup of these steps is implemented, as shown in the figure, but the corresponding other steps are implemented as in the prior art. The reason is that these improvements do not depend on each other, but can be applied independently of each other. However, in terms of the complexity of implementation, the more improvements are implemented, the better the process will be. Therefore, the complete implementation of the embodiment of Figure 11 is preferred because it provides the greatest degree of complexity reduction, but even if only one of steps 751, 752, 753, 754, 756 is implemented according to the optimization algorithm and the other steps are implemented as in the prior art, complexity reduction is also obtained without any quality degradation.

本发明的实施例也可以被视为如下过程：通过混合三个高斯(Gaussian)噪声源(每个通道一个高斯噪声源)和用于创建相关背景噪声的第三公共噪声源来为立体声信号生成舒适噪声，或者附加地或单独地，利用与SID帧一起发送的相干值控制对噪声源的混合。Embodiments of the present invention may also be viewed as a process of generating comfort noise for a stereo signal by mixing three Gaussian noise sources (one for each channel) and a third common noise source for creating correlated background noise, or additionally or separately, controlling the mixing of the noise sources using a coherence value sent with a SID frame.

这里要提到的是，之前和之后讨论的所有备选方案或方面以及由所附权利要求中的权利要求限定的所有方面可以单独地使用，即，没有与所设想的备选方案、目标或独立权利要求不同的任何其他备选方案或目标。然而，在其他实施例中，两个或更多个备选方案或方面或独立权利要求可以彼此组合，并且在其他实施例中，所有方面或备选方案和所有独立权利要求可以彼此组合。It is mentioned here that all alternatives or aspects discussed before and after and all aspects defined by the claims in the appended claims can be used separately, i.e., without any other alternatives or objectives different from the envisaged alternatives, objectives or independent claims. However, in other embodiments, two or more alternatives or aspects or independent claims can be combined with each other, and in other embodiments, all aspects or alternatives and all independent claims can be combined with each other.

本发明的编码信号可以存储在数字存储介质或非暂时性存储介质上，或者可以在诸如无线传输介质或诸如互联网的有线传输介质的传输介质上传输。The encoded signal of the present invention may be stored on a digital storage medium or a non-transitory storage medium, or may be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.

尽管已经在装置的上下文中描述了一些方面，但将清楚的是，这些方面还表示对应方法的描述，其中，块或装置对应于方法步骤或方法步骤的特征。类似地，在方法步骤上下文中描述的方面也指示对相应块或项或者相应装置的特征的描述。Although some aspects have been described in the context of an apparatus, it will be clear that these aspects also represent a description of a corresponding method, wherein a block or apparatus corresponds to a method step or a feature of a method step. Similarly, aspects described in the context of a method step also indicate a description of a feature of a corresponding block or item or a corresponding apparatus.

取决于某些实现要求，可以在硬件中或在软件中实现本发明的实施例。实现可以使用其上存储有电子可读控制信号的数字存储介质(例如，软盘、DVD、CD、ROM、PROM、EPROM、EEPROM或闪存)来执行，与可编程计算机系统协作(或能够协作)，使得执行相应方法。Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium (e.g., a floppy disk, DVD, CD, ROM, PROM, EPROM, EEPROM or flash memory) on which electronically readable control signals are stored, cooperating (or capable of cooperating) with a programmable computer system such that the corresponding method is performed.

根据本发明的一些实施例包括具有电子可读控制信号的数据载体，能够与可编程计算机系统协作，使得执行本文所述的方法之一。Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.

通常，本发明的实施例可以实现为具有程序代码的计算机程序产品，程序代码可操作以在计算机程序产品在计算机上运行时执行方法之一。该程序代码可以例如存储在机器可读载体上。Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative to perform one of the methods when the computer program product runs on a computer. The program code may, for example, be stored on a machine readable carrier.

其他实施例包括存储在机器可读载体或非暂时性存储介质上的用于执行本文描述的方法之一的计算机程序。Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier or a non-transitory storage medium.

换言之，本发明的方法的实施例因此是具有程序代码的计算机程序，该程序代码用于在计算机程序在计算机上运行时执行本文所述的方法之一。In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.

因此，本发明的方法的其他实施例是其上记录有计算机程序的数据载体或数字存储介质或计算机可读介质，该计算机程序用于执行本文所述的方法之一。A further embodiment of the inventive methods is, therefore, a data carrier or a digital storage medium or a computer-readable medium having recorded thereon the computer program for performing one of the methods described herein.

因此，本发明的方法的其他实施例是表示计算机程序的数据流或信号序列，所述计算机程序用于执行本文描述的方法之一。数据流或信号序列可以例如被配置为经由数据通信连接(例如，经由互联网)传送。Therefore, other embodiments of the method of the present invention are a data stream or a sequence of signals representing a computer program for performing one of the methods described herein. The data stream or the sequence of signals may, for example, be configured to be transmitted via a data communication connection (e.g., via the Internet).

另一实施例包括处理装置，例如，计算机或可编程逻辑器件，所述处理装置被配置为或适于执行本文所述的方法之一。A further embodiment comprises a processing means, for example a computer or a programmable logic device, configured to or adapted to perform one of the methods described herein.

另一实施例包括其上安装有计算机程序的计算机，该计算机程序用于执行本文所述的方法之一。A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.

在一些实施例中，可编程逻辑器件(例如，现场可编程门阵列)可以用于执行本文所述的方法的功能中的一些或全部。在一些实施例中，现场可编程门阵列可以与微处理器协作以执行本文所述的方法之一。通常，方法优选地由任意硬件装置来执行。In some embodiments, a programmable logic device (e.g., a field programmable gate array) can be used to perform some or all of the functions of the methods described herein. In some embodiments, a field programmable gate array can collaborate with a microprocessor to perform one of the methods described herein. Typically, the method is preferably performed by any hardware device.

上述实施例对于本发明的原理仅是说明性的。应当理解，本文描述的布置和细节的修改和变形对于本领域其他技术人员将是显而易见的。因此，旨在仅由所附专利权利要求的范围来限制而不是由借助对本文的实施例的描述和解释所给出的具体细节来限制。The above embodiments are merely illustrative of the principles of the present invention. It should be understood that modifications and variations of the arrangements and details described herein will be apparent to other persons skilled in the art. Therefore, it is intended that the scope of the present invention be limited only by the scope of the appended patent claims and not by the specific details given by way of the description and explanation of the embodiments herein.

方面(彼此独立使用、或与所有其他方面或仅其他方面的子组一起使用)Aspects (used independently of each other, or with all other aspects, or only a subgroup of other aspects)

装置、方法或计算机程序包括下面所提到的特征中的一个或多个：The apparatus, method or computer program may comprise one or more of the following features:

关于新颖方面的发明示例：Examples of inventions with regard to novel aspects:

·多波思想与对象编码相结合(每个T/F区使用多于一个方向提示)Multi-wave thinking combined with object encoding (using more than one directional cue per T/F zone)

·对象编码方法，其尽可能地接近DirAC范式，以允许IVAS中的任何种类的输入类型(目前未涵盖的对象内容)An object encoding method that is as close as possible to the DirAC paradigm to allow any kind of input type in IVAS (object content not currently covered)

关于参数化(编码器)的发明示例：Inventive example about parameterization (encoder):

·对于每个T/F区：该T/F区中的n个最相关对象的选择信息加上这n个最相关对象贡献之间的功率比For each T/F zone: the selection information of the n most relevant objects in the T/F zone plus the power ratio between the contributions of these n most relevant objects

·对于每个帧，对于每个对象：一个方向For each frame, for each object: one direction

关于渲染(解码器)的发明示例：Inventive example about rendering (decoder):

·从所发送的对象索引和方向信息以及目标输出布局中获取每个相关对象的直接响应值Get the direct response value for each relevant object from the sent object index and orientation information and the target output layout

·从直接响应中获取协方差矩阵Get the covariance matrix from the direct response

·根据每个相关对象的下混信号功率和发送功率比计算直接功率Calculate the direct power based on the downmix signal power and transmit power ratio of each relevant object

·从直接功率和协方差矩阵中获取最终目标协方差矩阵Obtain the final target covariance matrix from the direct power and covariance matrices

·仅使用输入协方差矩阵的对角线元素Only the diagonal elements of the input covariance matrix are used

优化的协方差合成Optimized Covariance Synthesis

关于与SAOC差异的一些旁注：Some side notes on the differences with SAOC:

·考虑n个主要对象而不是所有对象Consider n main objects instead of all objects

→功率比因此与OLD相关，但计算方式不同→ Power ratio is therefore related to OLD, but is calculated differently

·SAOC在编码器处不使用方向->方向信息仅在解码器(渲染矩阵)处被引入SAOC does not use direction at the encoder -> direction information is only introduced at the decoder (rendering matrix)

→SAOC-3D解码器接收用于渲染矩阵的对象元数据→SAOC-3D decoder receives object metadata for rendering matrices

·SAOC采用下混矩阵并发送下混增益SAOC uses the downmix matrix and sends the downmix gain

·本发明实施例不考虑扩散度The embodiment of the present invention does not consider the diffusion degree

随后，总结了本发明的其他示例。Subsequently, other examples of the present invention are summarized.

1.一种用于对多个音频对象和指示关于所述多个音频对象的方向信息的相关元数据进行编码的装置，包括：1. An apparatus for encoding a plurality of audio objects and associated metadata indicating directional information about the plurality of audio objects, comprising:

下混器(400)，用于对所述多个音频对象进行下混以获得一个或多个传输通道；A downmixer (400) for downmixing the plurality of audio objects to obtain one or more transmission channels;

传输通道编码器(300)，用于对一个或多个传输通道进行编码以获得一个或多个编码传输通道；以及A transmission channel encoder (300) for encoding one or more transmission channels to obtain one or more encoded transmission channels; and

输出接口(200)，用于输出包括所述一个或多个编码传输通道的编码音频信号，an output interface (200) for outputting a coded audio signal comprising said one or more coded transmission channels,

其中，所述下混器(400)被配置为响应于关于所述多个音频对象的方向信息而对所述多个音频对象进行下混。The downmixer (400) is configured to downmix the plurality of audio objects in response to directional information about the plurality of audio objects.

2.根据示例1所述的装置，其中，所述下混器(400)被配置为：2. The apparatus of example 1, wherein the downmixer (400) is configured to:

生成两个传输通道作为两个虚拟麦克风信号，所述两个虚拟麦克风信号布置在相同位置处并具有不同取向、或布置在相对于诸如虚拟听者位置或取向的参考位置或取向的两个不同位置处，或generating two transmission channels as two virtual microphone signals arranged at the same position and with different orientations, or arranged at two different positions relative to a reference position or orientation, such as a virtual listener position or orientation, or

生成三个传输通道作为三个虚拟麦克风信号，所述三个虚拟麦克风信号布置在相同位置处并具有不同取向、或布置在相对于诸如虚拟听者位置或取向的参考位置或取向的三个不同位置处，或generating three transmission channels as three virtual microphone signals arranged at the same position and with different orientations, or arranged at three different positions relative to a reference position or orientation, such as a virtual listener position or orientation, or

生成四个传输通道作为四个虚拟麦克风信号，所述四个虚拟麦克风信号布置在相同位置处并具有不同取向、或布置在相对于诸如虚拟听者位置或取向之类的参考位置或取向的四个不同位置处，或generating four transmission channels as four virtual microphone signals arranged at the same position and with different orientations, or arranged at four different positions relative to a reference position or orientation such as a virtual listener position or orientation, or

其中，所述虚拟麦克风信号是虚拟一阶麦克风信号、或虚拟心形麦克风信号、或虚拟8字形或偶极或双向麦克风信号、或虚拟定向麦克风信号、或虚拟亚心形麦克风信号、或虚拟单向麦克风信号、或虚拟超心形麦克风信号、或虚拟全向麦克风信号。The virtual microphone signal is a virtual first-order microphone signal, or a virtual cardioid microphone signal, or a virtual figure-8 or dipole or bidirectional microphone signal, or a virtual directional microphone signal, or a virtual sub-cardioid microphone signal, or a virtual unidirectional microphone signal, or a virtual supercardioid microphone signal, or a virtual omnidirectional microphone signal.

3.根据示例1或2所述的装置，其中，所述下混器(400)被配置为：3. The apparatus of example 1 or 2, wherein the downmixer (400) is configured to:

针对所述多个音频对象中的每个音频对象，使用对应音频对象的方向信息来导出(402)针对每个传输通道的加权信息；For each audio object of the plurality of audio objects, deriving (402) weighting information for each transmission channel using directional information of the corresponding audio object;

使用针对特定传输通道的音频对象的加权信息对所述对应音频对象进行加权(404)，以获得针对所述特定传输通道的对象贡献，以及weighting the corresponding audio object using the weighting information of the audio object for a specific transmission channel (404) to obtain an object contribution for the specific transmission channel, and

组合(406)所述多个音频对象对所述特定传输通道的对象贡献，以获得所述特定传输通道。The object contributions of the plurality of audio objects to the specific transmission channel are combined (406) to obtain the specific transmission channel.

4.根据前述示例之一所述的装置，4. The device according to one of the preceding examples,

其中，所述下混器(400)被配置为：计算所述一个或多个传输通道作为一个或多个虚拟麦克风信号，所述一个或多个虚拟麦克风信号布置在相同位置处并且具有不同取向、或布置在相对于诸如虚拟听者位置或取向之类的参考位置或取向的不同位置处，所述方向信息与所述参考位置或取向相关，The downmixer (400) is configured to calculate the one or more transmission channels as one or more virtual microphone signals, the one or more virtual microphone signals being arranged at the same position and having different orientations, or being arranged at different positions relative to a reference position or orientation such as a virtual listener position or orientation, the directional information being related to the reference position or orientation,

其中，所述不同位置或取向在中心线上或中心线的左侧和所述中心线上或所述中心线的右侧，或者其中，所述不同位置或取向均匀或不均匀地分布到水平位置或取向，例如相对于所述中心线+90度或-90度，或相对于所述中心线-120度、0度和+120度，或者其中，所述不同位置或取向包括相对于虚拟听者所在的水平面向上或向下指向的至少一个位置或取向，其中，关于所述多个音频对象的方向信息与所述虚拟听者位置或参考位置或取向相关。Wherein the different positions or orientations are on or to the left of the center line and on or to the right of the center line, or wherein the different positions or orientations are uniformly or unevenly distributed to horizontal positions or orientations, for example, +90 degrees or -90 degrees relative to the center line, or -120 degrees, 0 degrees and +120 degrees relative to the center line, or wherein the different positions or orientations include at least one position or orientation pointing upward or downward relative to a horizontal plane where a virtual listener is located, and wherein the directional information about the multiple audio objects is related to the virtual listener position or a reference position or orientation.

5.根据前述示例之一所述的装置，还包括：5. The apparatus according to one of the preceding examples, further comprising:

参数处理器(110)，用于对指示关于所述多个音频对象的方向信息的元数据进行量化，以获得所述多个音频对象的量化方向项，A parameter processor (110) is used to quantize metadata indicating directional information about the plurality of audio objects to obtain quantized directional items of the plurality of audio objects,

其中，所述下混器(400)被配置为：响应于作为所述方向信息的所述量化方向项进行操作，以及The downmixer (400) is configured to: operate in response to the quantized direction term as the direction information, and

其中，所述输出接口(200)被配置为：将关于所述量化方向项的信息引入到所述编码音频信号中。The output interface (200) is configured to introduce information about the quantization direction item into the encoded audio signal.

6.根据前述示例之一所述的装置，6. The device according to one of the preceding examples,

其中，所述下混器(400)被配置为：对关于所述多个音频对象的方向信息执行分析，并根据所述分析的结果放置一个或多个虚拟麦克风以便生成所述传输通道。The downmixer (400) is configured to: perform analysis on directional information about the plurality of audio objects, and place one or more virtual microphones according to the result of the analysis so as to generate the transmission channel.

7.根据前述示例之一所述的装置，7. The device according to one of the preceding examples,

其中，所述下混器(400)被配置为：使用在多个时间帧上是静态的下混规则进行下混(408)，或wherein the downmixer (400) is configured to: perform downmixing (408) using a downmixing rule that is static over a plurality of time frames, or

其中，所述方向信息在多个时间帧上是可变的，并且其中，所述下混器(400)被配置为使用在所述多个时间帧上是可变的下混规则进行下混(405)。wherein the directional information is variable over a plurality of time frames, and wherein the downmixer (400) is configured to perform the downmix (405) using a downmix rule that is variable over the plurality of time frames.

8.根据前述示例之一所述的装置，8. The device according to one of the preceding examples,

其中，所述下混器(400)被配置为：使用所述多个音频对象的样本的逐个样本加权和组合在时域中进行下混。The downmixer (400) is configured to perform downmixing in the time domain using sample-by-sample weighting and combination of samples of the plurality of audio objects.

9.根据前述示例之一所述的装置，还包括：9. The apparatus according to one of the preceding examples, further comprising:

对象参数计算器(100)，被配置为：针对与时间帧相关的多个频率区间中的一个或多个频率区间，计算至少两个相关音频对象的参数数据，其中，所述至少两个相关音频对象的数量低于所述多个音频对象的总数，以及An object parameter calculator (100) is configured to: calculate parameter data of at least two related audio objects for one or more frequency intervals of a plurality of frequency intervals associated with a time frame, wherein the number of the at least two related audio objects is less than the total number of the plurality of audio objects, and

其中，所述输出接(200)被配置为：将关于针对所述一个或多个频率区间的所述至少两个相关音频对象的参数数据的信息引入到所述编码音频信号中。The output interface (200) is configured to introduce information about parameter data of the at least two related audio objects for the one or more frequency intervals into the encoded audio signal.

10.根据示例9所述的装置，其中，所述对象参数计算器(100)被配置为：10. The apparatus according to example 9, wherein the object parameter calculator (100) is configured to:

将所述多个音频对象中的每个音频对象转换(120)为具有所述多个频率区间的频谱表示，converting (120) each audio object of the plurality of audio objects into a spectral representation having the plurality of frequency bins,

针对所述一个或多个频率区间计算(122)每个音频对象的选择信息，以及calculating (122) selection information for each audio object for the one or more frequency intervals, and

基于所述选择信息来导出(124)对象标识作为指示所述至少两个相关音频对象的参数数据，以及deriving (124) an object identification as parameter data indicative of the at least two related audio objects based on the selection information, and

其中，所述输出接口(200)被配置为将关于所述对象标识的信息引入到所述编码音频信号中。Therein, the output interface (200) is configured to introduce information about the object identification into the encoded audio signal.

11.根据示例9或10所述的装置，其中，所述对象参数计算器(100)被配置为：对所述一个或多个频率区间中的所述相关音频对象的一个或多个幅度相关测量值或从幅度相关测量值导出的一个或多个组合值进行量化和编码(212)，作为所述参数数据，以及11. An apparatus according to example 9 or 10, wherein the object parameter calculator (100) is configured to: quantize and encode (212) one or more amplitude-related measurement values of the relevant audio object in the one or more frequency intervals or one or more combined values derived from the amplitude-related measurement values as the parameter data, and

其中，所述输出接(200)被配置为：将经量化的一个或多个幅度相关测量值或经量化的一个或多个组合值引入到所述编码音频信号中。The output interface (200) is configured to introduce the quantized one or more amplitude-related measurement values or the quantized one or more combined values into the encoded audio signal.

l 2.根据示例10或11所述的装置，2. The apparatus according to example 10 or 11,

其中，所述选择信息是所述音频对象的诸如幅度值、功率值或响度值、或提高到不同于1的功率的幅度之类的幅度相关测量值，以及wherein the selection information is an amplitude-related measurement value of the audio object, such as an amplitude value, a power value or a loudness value, or an amplitude raised to a power different from 1, and

其中，所述对象参数计算器(100)被配置为：计算(127)组合值，例如相关音频对象的幅度相关测量值与相关音频对象的两个或更多个幅度相关测量值之和的比率，以及wherein the object parameter calculator (100) is configured to: calculate (127) a combined value, such as a ratio of an amplitude-related measure of the associated audio object to a sum of two or more amplitude-related measures of the associated audio object, and

其中，所述输出接(200)被配置为：将关于所述组合值的信息引入到所述编码音频信号中，其中，所述编码音频信号中关于组合值的信息项的数量至少等于1且小于所述一个或多个频率区间的相关音频对象的数量。The output interface (200) is configured to introduce information about the combination value into the encoded audio signal, wherein the number of information items about the combination value in the encoded audio signal is at least equal to 1 and less than the number of related audio objects in the one or more frequency intervals.

13.根据示例10至12之一所述的装置，13. The device according to any one of examples 10 to 12,

其中，所述对象参数计算器(100)被配置为：基于所述一个或多个频率区间中的所述多个音频对象的选择信息的顺序来选择所述对象标识。The object parameter calculator (100) is configured to select the object identifier based on the order of the selection information of the plurality of audio objects in the one or more frequency intervals.

14.根据示例10至13之一所述的装置，其中，所述对象参数计算器(100)被配置为：14. The apparatus according to any one of examples 10 to 13, wherein the object parameter calculator (100) is configured to:

计算(122)信号功率作为所述选择信息，calculating (122) signal power as the selection information,

针对每个频率区间分别导出(124)对应一个或多个频率区间中的具有最大信号功率值的两个或更多个音频对象的对象标识，For each frequency interval, deriving (124) object identifications of two or more audio objects having maximum signal power values in one or more frequency intervals, respectively,

计算(126)具有所述最大信号功率值的两个或更多个音频对象的信号功率之和与具有所导出的对象标识的音频对象中的至少一个音频对象的信号功率之间的功率比，作为所述参数数据，以及calculating (126) as the parameter data a power ratio between a sum of signal powers of two or more audio objects having the maximum signal power value and a signal power of at least one of the audio objects having the derived object identification, and

对所述功率比进行量化和编码(212)，以及quantizing and encoding the power ratio (212), and

其中，所述输出接口(200)被配置为：将经量化和编码的功率比引入到所述编码音频信号中。Wherein, the output interface (200) is configured to: introduce the quantized and encoded power ratio into the encoded audio signal.

15.根据示例10至14之一所述的装置，其中，所述输出接口(200)被配置为：将以下内容引入到所述编码音频信号中：15. The apparatus of any one of Examples 10 to 14, wherein the output interface (200) is configured to introduce the following into the encoded audio signal:

一个或多个编码传输通道，One or more coded transmission channels,

所述时间帧中的所述多个频率区间中的一个或多个频率区间中的每个频率区间的相关音频对象的两个或更多个编码对象标识、以及一个或多个编码组合值或编码幅度相关测量值，作为所述参数数据，以及two or more coded object identifications of the audio objects associated with each of one or more of the plurality of frequency intervals in the time frame, and one or more coded combination values or coded amplitude related measurement values, as the parameter data, and

所述时间帧中的每个音频对象的量化和编码方向数据，所述方向数据对于所述一个或多个频率区间中的所有频率区间是恒定的。quantizing and encoding directional data for each audio object in the time frame, the directional data being constant for all frequency bins of the one or more frequency bins.

16.根据示例9至15之一所述的装置，其中，所述对象参数计算器(100)被配置为：计算所述一个或多个频率区间中至少最主要对象和第二最主要对象的参数数据，或16. The apparatus according to any one of Examples 9 to 15, wherein the object parameter calculator (100) is configured to: calculate parameter data of at least the most dominant object and the second most dominant object in the one or more frequency intervals, or

其中，所述多个音频对象中的音频对象的数量是三个或更多个，所述多个音频对象包括第一音频对象、第二音频对象和第三音频对象，以及wherein the number of audio objects in the plurality of audio objects is three or more, the plurality of audio objects include a first audio object, a second audio object, and a third audio object, and

其中，所述对象参数计算器(100)被配置为：针对所述一个或多个频率区间中的第一频率区间，仅计算诸如所述第一音频对象和所述第二音频对象的第一组音频对象作为所述相关音频对象；以及针对所述一个或多个频率区间中的第二频率区间，仅计算诸如所述第二音频对象和所述第三音频对象或所述第一音频对象和所述第三音频对象的第二组音频对象作为所述相关音频对象，其中，所述第一组音频对象至少在一个组成员方面不同于所述第二组音频对象。The object parameter calculator (100) is configured to: for a first frequency interval among the one or more frequency intervals, only calculate a first group of audio objects such as the first audio object and the second audio object as the related audio objects; and for a second frequency interval among the one or more frequency intervals, only calculate a second group of audio objects such as the second audio object and the third audio object or the first audio object and the third audio object as the related audio objects, wherein the first group of audio objects is different from the second group of audio objects in at least one group member.

17.根据示例9至16之一所述的装置，其中，所述对象参数计算器(100)被配置为：17. The apparatus according to any one of examples 9 to 16, wherein the object parameter calculator (100) is configured to:

计算具有第一时间或频率分辨率的原始参数化数据，并将所述原始参数化数据组合为具有比所述第一时间或频率分辨率低的第二时间或频率分辨率的组合参数化数据，并且相对于具有所述第二时间或频率分辨率的组合参数化数据计算至少两个相关音频对象的参数数据，或calculating original parametric data having a first time or frequency resolution and combining the original parametric data into combined parametric data having a second time or frequency resolution lower than the first time or frequency resolution, and calculating parametric data for at least two related audio objects relative to the combined parametric data having the second time or frequency resolution, or

确定具有与所述多个音频对象的时间或频率分解中使用的第一时间或频率分辨率不同的第二时间或频率分辨率的参数带，并且针对具有所述第二时间或频率分辨率的参数带计算至少两个相关音频对象的参数数据。A parameter band having a second time or frequency resolution different from the first time or frequency resolution used in the time or frequency decomposition of the plurality of audio objects is determined, and parameter data for at least two related audio objects is calculated for the parameter band having the second time or frequency resolution.

18.一种用于对编码音频信号进行解码的解码器，所述编码音频信号包括：多个音频对象的一个或多个传输通道和方向信息；以及针对时间帧的一个或多个频率区间而言的音频对象的参数数据，所述解码器包括：18. A decoder for decoding a coded audio signal, the coded audio signal comprising: one or more transmission channels and direction information of a plurality of audio objects; and parameter data of the audio objects for one or more frequency intervals of a time frame, the decoder comprising:

输入接口(600)，用于提供频谱表示形式的所述一个或多个传输通道，所述频谱表示在所述时间帧中具有多个频率区间；以及an input interface (600) for providing the one or more transmission channels in the form of a spectral representation, the spectral representation having a plurality of frequency bins in the time frame; and

音频渲染器(700)，用于使用所述方向信息将所述一个或多个传输通道渲染为多个音频通道，an audio renderer (700), configured to render the one or more transmission channels into a plurality of audio channels using the direction information,

其中，所述音频渲染器(700)被配置为：根据所述多个频率区间中的每个频率区间的所述一个或多个音频对象以及与所述频率区间中的一个或多个相关音频对象相关联的方向信息(810)，计算直接响应信息(704)。The audio renderer (700) is configured to calculate direct response information (704) based on the one or more audio objects in each frequency interval of the multiple frequency intervals and direction information (810) associated with one or more related audio objects in the frequency interval.

19.根据示例18所述的解码器，19. The decoder according to example 18,

其中，所述音频渲染器(700)被配置为：使用所述直接响应信息和关于所述多个音频通道的信息(702)来计算(706)协方差合成信息，并且将所述协方差合成信息应用(727)于所述一个或多个传输通道以获得所述多个音频通道，或wherein the audio renderer (700) is configured to calculate (706) covariance synthesis information using the direct response information and the information about the plurality of audio channels (702), and to apply (727) the covariance synthesis information to the one or more transmission channels to obtain the plurality of audio channels, or

其中，所述直接响应信息(704)是一个或多个音频对象中的每个音频对象的直接响应向量，并且其中，所述协方差合成信息是协方差合成矩阵，并且其中，所述音频渲染器(700)被配置为：在应用(727)所述协方差合成信息时针对每个频率区间执行矩阵运算。wherein the direct response information (704) is a direct response vector of each audio object in one or more audio objects, and wherein the covariance synthesis information is a covariance synthesis matrix, and wherein the audio renderer (700) is configured to perform matrix operations for each frequency interval when applying (727) the covariance synthesis information.

20.根据示例18或19所述的解码器，其中，所述音频渲染器(700)被配置为：20. The decoder of example 18 or 19, wherein the audio renderer (700) is configured to:

在计算所述直接响应信息(704)时，导出所述一个或多个音频对象的直接响应向量，并且针对所述一个或多个音频对象，根据每个直接响应向量来计算协方差矩阵，When calculating the direct response information (704), direct response vectors of the one or more audio objects are derived, and for the one or more audio objects, a covariance matrix is calculated according to each direct response vector.

在计算所述协方差合成信息时，从以下内容导出(724)目标协方差信息：一个音频对象的协方差矩阵或多个音频对象的协方差矩阵，关于相应一个或多个音频对象的功率信息，以及从所述一个或多个传输通道导出的功率信息。When calculating the covariance synthesis information, target covariance information is derived (724) from: a covariance matrix of an audio object or covariance matrices of multiple audio objects, power information about the corresponding one or more audio objects, and power information derived from the one or more transmission channels.

21.根据示例20所述的解码器，其中，所述音频渲染器(700)被配置为：21. The decoder of example 20, wherein the audio renderer (700) is configured to:

在计算所述直接响应信息时，导出所述一个或多个音频对象的直接响应向量，并且针对每一个或多个音频对象，根据每个直接响应向量来计算(723)协方差矩阵，When calculating the direct response information, direct response vectors of the one or more audio objects are derived, and for each of the one or more audio objects, a covariance matrix is calculated (723) based on each direct response vector,

从所述传输通道导出(726)输入协方差信息，以及deriving (726) input covariance information from the transmission channel, and

从所述目标协方差信息、所述输入协方差信息和关于多个通道的信息导出(725a，725b)混合信息，以及deriving (725a, 725b) mixing information from the target covariance information, the input covariance information and information about a plurality of channels, and

将所述混合信息应用(727)于所述时间帧中的每个频率区间的传输通道。The mixing information is applied (727) to the transmission channels of each frequency interval in the time frame.

22.根据示例21所述的解码器，其中，针对所述时间帧中的每个频率区间应用混合信息的结果被转换(708)到时域中以获得时域中的多个音频通道。22. The decoder of example 21, wherein a result of applying the mixing information for each frequency bin in the time frame is converted (708) into the time domain to obtain a plurality of audio channels in the time domain.

23.根据示例18至22之一所述的解码器，其中，所述音频渲染器(700)被配置为：23. The decoder of any one of Examples 18 to 22, wherein the audio renderer (700) is configured to:

在分解(752)从所述传输通道导出的输入协方差矩阵时，仅使用所述输入协方差矩阵的主对角线元素，或in decomposing (752) an input covariance matrix derived from the transmission channel, using only the main diagonal elements of the input covariance matrix, or

使用直接响应矩阵以及所述对象或传输通道的功率矩阵来执行对目标协方差矩阵的分解(751)，或performing a decomposition of a target covariance matrix using a direct response matrix and a power matrix of the subject or transmission channel (751), or

通过取所述输入协方差矩阵的每个主对角线元素的根来执行(752)对所述输入协方差矩阵的分解，或performing (752) a decomposition of the input covariance matrix by taking the root of each main diagonal element of the input covariance matrix, or

计算(753)经分解的输入协方差矩阵的正则化逆矩阵，或Compute (753) the regularized inverse of the factorized input covariance matrix, or

在计算要用于能量补偿的最佳矩阵时执行(756)奇异值分解，而无需扩展单位矩阵。A singular value decomposition is performed (756) when computing the optimal matrix to be used for energy compensation without expanding the identity matrix.

24.根据示例18至23之一所述的解码器，其中，所述一个或多个音频对象的参数数据包括至少两个相关音频对象的参数数据，其中，所述至少两个相关音频对象的数量低于所述多个音频对象的总数，以及24. A decoder according to any one of Examples 18 to 23, wherein the parameter data of the one or more audio objects includes parameter data of at least two related audio objects, wherein the number of the at least two related audio objects is less than the total number of the plurality of audio objects, and

其中，所述音频渲染器(700)被配置为：针对所述一个或多个频率区间中的每一频率区间，根据与所述至少两个相关音频对象中的第一相关音频对象相关联的第一方向信息以及根据与所述至少两个相关音频对象中的第二相关音频对象相关联的第二方向信息，计算来自所述一个或多个传输通道的贡献。The audio renderer (700) is configured to calculate, for each frequency interval in the one or more frequency intervals, a contribution from the one or more transmission channels based on first directional information associated with a first related audio object in the at least two related audio objects and based on second directional information associated with a second related audio object in the at least two related audio objects.

25.根据示例24所述的解码器，25. The decoder according to example 24,

其中，所述音频渲染器(700)被配置为：针对所述一个或多个频率区间，忽略与所述至少两个相关音频对象不同的音频对象的方向信息。The audio renderer (700) is configured to: for the one or more frequency intervals, ignore direction information of audio objects different from the at least two related audio objects.

26.根据示例24或25所述的解码器，26. The decoder according to example 24 or 25,

其中，所述编码音频信号包括所述参数数据中的每个相关音频对象的幅度相关测量值或与至少两个相关音频对象相关的组合值，以及wherein the encoded audio signal comprises an amplitude-related measurement value of each relevant audio object in the parameter data or a combined value associated with at least two relevant audio objects, and

其中，所述音频渲染器(700)被配置为：根据与所述至少两个相关音频对象中的第一相关音频对象相关联的第一方向信息以及根据与所述至少两个相关音频对象中的第二相关音频对象相关联的第二方向信息，将来自所述一个或多个传输通道的贡献考虑在内进行操作，或者根据所述幅度相关测量值或所述组合值来确定所述一个或多个传输通道的定量贡献。The audio renderer (700) is configured to: operate based on first directional information associated with a first related audio object among the at least two related audio objects and based on second directional information associated with a second related audio object among the at least two related audio objects, taking into account the contribution from the one or more transmission channels, or determine the quantitative contribution of the one or more transmission channels based on the amplitude-related measurement value or the combined value.

27.根据示例26所述的解码器，其中，编码信号包括所述参数数据中的组合值，以及27. The decoder of example 26, wherein the encoded signal comprises a combined value in the parameter data, and

其中，所述音频渲染器(700)被配置为：使用相关音频对象之一的组合值和该一个相关音频对象的方向信息来确定所述一个或多个传输通道的贡献，以及The audio renderer (700) is configured to determine the contribution of the one or more transmission channels using a combined value of one of the related audio objects and directional information of the one related audio object, and

其中，所述音频渲染器(700)被配置为：使用从所述一个或多个频率区间中的相关音频对象中的另一相关音频对象的组合值以及所述另一相关音频对象的方向信息导出的值，确定所述一个或多个传输通道的贡献。The audio renderer (700) is configured to determine the contribution of the one or more transmission channels using a value derived from a combined value of another related audio object in the one or more frequency intervals and directional information of the other related audio object.

28.根据示例24至27之一所述的解码器，其中，所述音频渲染器(700)被配置为：28. The decoder of any one of Examples 24 to 27, wherein the audio renderer (700) is configured to:

根据所述多个频率区间中的每个频率区间的相关音频对象以及与所述频率区间中的相关音频对象相关联的方向信息，计算所述直接响应信息(704)。The direct response information is calculated based on the relevant audio object in each frequency interval of the plurality of frequency intervals and the direction information associated with the relevant audio object in the frequency interval (704).

29.根据示例28所述的解码器，29. The decoder according to example 28,

其中，所述音频渲染器(700)被配置为：使用元数据中包括的诸如扩散度参数之类的扩散度信息或解相关规则来确定(741)所述多个频率区间中的每个频率区间的扩散信号，并且组合由所述直接响应信息确定的直接响应和所述扩散信号，以获得所述多个通道中的通道的谱域渲染信号。The audio renderer (700) is configured to determine (741) a diffuse signal for each of the multiple frequency intervals using diffuseness information such as a diffuseness parameter included in the metadata or a decorrelation rule, and to combine a direct response determined by the direct response information and the diffuse signal to obtain a spectral domain rendering signal for a channel of the multiple channels.

30.一种对多个音频对象和指示关于所述多个音频对象的方向信息的相关元数据进行编码的方法，包括：30. A method of encoding a plurality of audio objects and associated metadata indicating directional information about the plurality of audio objects, comprising:

对所述多个音频对象进行下混以获得一个或多个传输通道；downmixing the plurality of audio objects to obtain one or more transmission channels;

对所述一个或多个传输通道进行编码以获得一个或多个编码传输通道；以及encoding the one or more transport channels to obtain one or more coded transport channels; and

输出包括所述一个或多个编码传输通道的编码音频信号，outputting an encoded audio signal comprising said one or more encoded transmission channels,

其中，所述下混包括响应于关于所述多个音频对象的方向信息而对所述多个音频对象进行下混。The downmixing includes downmixing the multiple audio objects in response to directional information about the multiple audio objects.

31.一种对编码音频信号进行解码的方法，所述编码音频信号包括：多个音频对象的一个或多个传输通道和方向信息；以及针对时间帧的一个或多个频率区间的音频对象的参数数据，所述方法包括：31. A method for decoding an encoded audio signal, the encoded audio signal comprising: one or more transmission channels and direction information of a plurality of audio objects; and parameter data of the audio objects for one or more frequency intervals of a time frame, the method comprising:

提供频谱表示形式的所述一个或多个传输通道，所述频谱表示在所述时间帧中具有多个频率区间；以及providing the one or more transmission channels in the form of a frequency spectrum representation having a plurality of frequency bins in the time frame; and

使用所述方向信息将所述一个或多个传输通道音频渲染为多个音频通道，rendering the one or more transmission channel audio into a plurality of audio channels using the direction information,

其中，所述音频渲染包括：根据所述多个频率区间中的每个频率区间的一个或多个音频对象以及与所述频率区间中的一个或多个相关音频对象相关联的方向信息，计算直接响应信息。The audio rendering includes: calculating direct response information according to one or more audio objects in each frequency interval of the multiple frequency intervals and direction information associated with one or more related audio objects in the frequency interval.

32.一种计算机程序，当在计算机或处理器上运行时，用于执行根据示例30所述的方法或根据示例31所述的方法。32. A computer program for executing the method according to example 30 or the method according to example 31 when run on a computer or a processor.

2参考文献 2 References

[Pulkki2009]V.Pulkki，M-V.Laitinen，J.Vilkamo，J.Ahonen，T.Lokki，and T.“Directional audio coding perception-based reproduction ofspatial sound”，International Workshop on the Principles and Application onSpatial Hearing，Nov.2009，Zao；Miyagi，Japan.[Pulkki2009] V. Pulkki, MV. Laitinen, J. Vilkamo, J. Ahonen, T. Lokki, and T. "Directional audio coding perception-based reproduction of spatial sound", International Workshop on the Principles and Application onSpatial Hearing, Nov.2009, Zao; Miyagi, Japan.

[SAOC_STD]ISO/IEC，“MPEG audio technologies Part 2：Spatial AudioObject Coding(SAOC).”ISO/IEC JTC1/SC29/WG11(MPEG)International Standard23003-2.[SAOC_STD]ISO/IEC, "MPEG audio technologies Part 2: Spatial AudioObject Coding (SAOC)." ISO/IEC JTC1/SC29/WG11(MPEG)International Standard23003-2.

[SAOC_AES]J.Herre，H.Purnhagen，J.Koppens，O.Hellmuth，J.J.Hilpert，L.Villemoes，L.Terentiv，C.Falch，A.M.L.Valero，B.Resch，H.MundtH，and H.Oh，“MPEGspatial audio object coding-the ISO/MPEG standard forefficient coding of interactive audio scenes，”J.AES，vol.60，no.9，pp.655-673，Sep.2012.[SAOC_AES] J. Herre, H. Purnhagen, J. Koppens, O. Hellmuth, J. J. Hilpert, L. Villemoes, L. Terentiv, C. Falch, A. MLValero, B.Resch, H.MundtH, and H.Oh, "MPEGspatial audio object coding-the ISO/MPEG standard for efficient coding of interactive audio scenes," J.AES, vol.60, no.9, pp.655- 673, Sep.2012.

[MPEGH_AES]J.Herre，J.Hilpert，A.Kuntz，and J.Plogsties，“MPEG-H audio-the new standard for universal spatial/3D audio coding，”in Proc.137^th AESConv.，Los Angeles，CA，USA，2014.[MPEGH_AES] J. Herre, J. Hilpert, A. Kuntz, and J. Plogsties, "MPEG-H audio-the new standard for universal spatial/3D audio coding," in Proc.137 ^th AESConv., Los Angeles, CA , USA, 2014.

[MPEGH_IEEE]J.Herre，J.Hilpert，A.Kuntz，and J.Plogsties，“MPEG-H 3DAudio-The New Standard for Coding of Immersive Spatial Audio“，IEEE JOURNAL OFSELECTED TOPICS IN SIGNAL PROCESSING，VOL.9，NO.5，AUGUST 2015[MPEGH_IEEE] J.Herre, J.Hilpert, A.Kuntz, and J.Plogsties, "MPEG-H 3DAudio-The New Standard for Coding of Immersive Spatial Audio", IEEE JOURNAL OFSELECTED TOPICS IN SIGNAL PROCESSING, VOL.9, NO. .5, AUGUST 2015

[MPEGH_STD]Text of ISO/MPEG 23008-3/DIS 3D Audio，Sapporo，ISO/IECJTC1/SC29/WG11 N14747，Jul.2014.[MPEGH_STD]Text of ISO/MPEG 23008-3/DIS 3D Audio, Sapporo, ISO/IECJTC1/SC29/WG11 N14747, Jul.2014.

[SAOC_3D_PAT]APPARATUS AND METHOD FOR ENHANCED SPATAL AUDIO OBJECTCODING，WO 2015/011024 A1[SAOC_3D_PAT] APPARATUS AND METHOD FOR ENHANCED SPATAL AUDIO OBJECTCODING，WO 2015/011024 A1

[Pulkki1997]V.Pulkki，“Virtual sound source positioning using vectorbase amplitude panning，”J.Audio Eng.Soc.，vol.45，no.6，pp.456-466，Jun.1997.[Pulkki1997]V.Pulkki, "Virtual sound source positioning using vectorbase amplitude panning," J.Audio Eng.Soc., vol.45, no.6, pp.456-466, Jun.1997.

[DELAUNAY]C.B.Barber，D.P.Dobkin，and H.Huhdanpaa，“The quickhullalgorithm for convex hulls，“in Proc.ACM Trans.Math.Software(TOMS)，New York，NY，USA，Dec.1996，vol.22，pp.469-483.[DELAUNAY]C.B.Barber, D.P.Dobkin, and H.Huhdanpaa, "The quickhullalgorithm for convex hulls," in Proc.ACM Trans.Math.Software(TOMS), New York, NY, USA, Dec.1996, vol.22, pp.469-483.

[Hirvonen2009]T.Hirvonen，J.Ahonen，and V.Pulkki，“Perceptualcompression methods for metadata in Directional Audio Coding applied toaudiovisual teleconference”，AES 126^th Convention 2009，May 7-10，Munich，Germany.[Hirvonen2009] T.Hirvonen, J.Ahonen, and V.Pulkki, "Perceptualcompression methods for metadata in Directional Audio Coding applied to audiovisual teleconference", AES 126 ^th Convention 2009, May 7-10, Munich, Germany.

[Borβ2014]C.Borβ，“A Polygon-Based Panning Method for 3D LoudspeakerSetups”，AES 137^th Convention 2014，October 9-12，Los Angeles，USA.[Borβ2014] C. Borβ, "A Polygon-Based Panning Method for 3D LoudspeakerSetups", AES 137 ^th Convention 2014, October 9-12, Los Angeles, USA.

[WO2019068638]Apparatus，method and computer program for encoding，decoding，scene processing and other procedures related to DirAC based spatialaudio coding，2018[WO2019068638]Apparatus, method and computer program for encoding, decoding, scene processing and other procedures related to DirAC based spatialaudio coding, 2018

[WO2020249815]PARAMETER ENCODING AND DECODING FOR MULTICHANNEL AUDIOUSING DirAC，2019[WO2020249815]PARAMETER ENCODING AND DECODING FOR MULTICHANNEL AUDIOUSING DIRAC，2019

[BCC2001]C.Faller，F.Baumgarte：“Efficient representation of spatialaudio using perceptual parametrization”，Proceedings of the 2001 IEEE Workshopon the Applications of Signal Processing to Audio and Acoustics(Cat.No.01TH8575).[BCC2001] C.Faller, F.Baumgarte: "Efficient representation of spatialaudio using perceptual parametrization", Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat.No.01TH8575).

[JOC_AES]Heiko Purnhagen；Toni Hirvonen；Lars Villemoes；JonasSamuelsson；Janusz Klejsa：“Immersive Audio Delivery Using Joint ObjectCoding”，140^thAESConvention，Paper Number：9587，Paris，May 2016.[JOC_AES] Heiko Purnhagen; Toni Hirvonen; Lars Villemoes; Jonas Samuelsson; Janusz Klejsa: "Immersive Audio Delivery Using Joint ObjectCoding", ^140th AESConvention, Paper Number: 9587, Paris, May 2016.

[AC4_AES]K.J.M.Wolters，J.Riedmiller，A.Biswas，P.Ekstrand，A.P.Hedelin，T.Hirvonen，H.J.Klejsa，J.Koppens，K.Krauss，H-M.Lehtonen，K.Linzmeier，H.Muesch，H.Mundt，S.Norcross，J.Popp，H.Purnhagen，J.Samuelsson，M.Schug，L.R.Thesing，L.Villemoes，andM.Vinton：“AC-4-The Next Generation Audio Codec”，140^th AES Convention，PaperNumber：9491，Paris，May2016.[AC4_AES]K. J. M. Wolters, J. Riedmiller, A. Biswas, P. Ekstrand, A. P. Hedelin, T. Hirvonen, H. J.Klejsa, J.Koppens, K.Krauss, HM.Lehtonen, K.Linzmeier, H.Muesch, H.Mundt, S.Norcross, J.Popp, H.Purnhagen, J.Samuelsson, M.Schug, L. R.Thesing, L.Villemoes, andM.Vinton: "AC-4-The Next Generation Audio Codec", ^140th AES Convention, PaperNumber: 9491, Paris, May2016.

[Vilkamo2013]J.Vilkamo，T.A.Kuntz，“Optimized covariancedomain framework for time-frequency processing of spatial audio”，Journal ofthe Audio Engineering Society，2013.[Vilkamo2013] J. Vilkamo, T. A. Kuntz, "Optimized covariancedomain framework for time-frequency processing of spatial audio", Journal of the Audio Engineering Society, 2013.

[Golub2013]Gene H.Golub and Charles F.Van Loan，“Matrix Computations”，Johns Hopkins University Press，4th edition，2013。[Golub2013]Gene H.Golub and Charles F.Van Loan, "Matrix Computations", Johns Hopkins University Press, 4th edition, 2013.

Claims

1. An apparatus for encoding a plurality of audio objects, comprising:

An object parameter calculator (100), configured to: for one or more frequency intervals in a plurality of frequency intervals associated with a time frame, calculate parameter data of at least two related audio objects, wherein the at least two related the number of audio objects is less than the total number of said plurality of audio objects, and

An output interface (200) configured to output an encoded audio signal comprising information about parameter data of said at least two associated audio objects of said one or more frequency intervals.

2. The apparatus according to claim 1, wherein the object parameter calculator (100) is configured to:

converting (120) each audio object of the plurality of audio objects into a spectral representation having a plurality of frequency bins,

calculating (122) selection information for each audio object of said one or more frequency intervals, and

Based on said selection information, deriving (124) an object identification as parameter data indicative of said at least two related audio objects, and

Wherein said output interface (200) is configured to introduce information about said object identification into said encoded audio signal.

3. The apparatus according to claim 1 or 2, wherein the object parameter calculator (100) is configured to: one or more amplitude-related measurements of related audio objects in the one or more frequency intervals value or one or more combined values derived from magnitude-related measurements are quantized and encoded (212) as said parameter data, and

Wherein, the output interface (200) is configured to introduce quantized one or more amplitude-related measurement values or quantized one or more combined values into the encoded audio signal.

4. The device according to claim 2 or 3,

wherein said selection information is an amplitude-related measure of said audio object, such as an amplitude value, a power value or a loudness value, or an amplitude raised to a power different from 1, and

Wherein said object parameter calculator (100) is configured to calculate (127) a combined value, such as a ratio of an amplitude-related measure of a related audio object to a sum of two or more amplitude-related measures of the related audio object, as well as

Wherein, the output interface (200) is configured to introduce information about the combined value into the encoded audio signal, wherein the number of information items about the combined value in the encoded audio signal is at least equal to 1 and less than the number of related audio objects for the one or more frequency bins.

5. The device according to one of claims 2 to 4,

Wherein, the object parameter calculator (100) is configured to select the object identifier based on an order of selection information of the plurality of audio objects in the one or more frequency intervals.

6. The apparatus according to one of claims 2 to 5, wherein the object parameter calculator (100) is configured to:

calculating (122) signal power as said selection information,

deriving (124) object identities corresponding to two or more audio objects having maximum signal power values in one or more frequency intervals for each frequency interval, respectively,

calculating (126) a power ratio between the sum of the signal powers of the two or more audio objects having said maximum signal power value and the signal power of each of the audio objects having the derived object identity as the parameter data, and

quantizing and encoding (212) the power ratio, and

Wherein said output interface (200) is configured to introduce the quantized and encoded power ratio into said encoded audio signal.

7. The apparatus according to one of claims 1 to 6, wherein the output interface (200) is configured to introduce into the encoded audio signal:

one or more encoded transport channels,

As said parametric data, two or more coded object identifiers of associated audio objects for each of one or more of the plurality of frequency intervals in said time frame, and one or more coded combination values or coded magnitude-related measurements, and

Quantized and encoded orientation data for each audio object in the time frame, the orientation data being constant for all of the one or more frequency bins.

8. The apparatus according to one of claims 1 to 7, wherein the object parameter calculator (100) is configured to: calculate at least the most dominant object and the second most dominant the parameter data of the object, or

Wherein, the number of audio objects in the plurality of audio objects is three or more, and the plurality of audio objects includes a first audio object, a second audio object and a third audio object, and

Wherein, the object parameter calculator (100) is configured to: for the first frequency interval in the one or more frequency intervals, only calculate the first audio object such as the first audio object and the second audio object a set of audio objects as said related audio objects; and for a second frequency interval in said one or more frequency intervals, only computing such as said second audio object and said third audio object or said first audio object and a second group of audio objects of the third audio object as the related audio objects, wherein the first group of audio objects differs from the second group of audio objects in at least one group member.

9. The apparatus according to one of claims 1 to 8, wherein the object parameter calculator (100) is configured to:

computing raw parametric data having a first time or frequency resolution and combining said raw parametric data into combined parametric data having a second time or frequency resolution lower than said first time or frequency resolution , and calculating parameter data of said at least two related audio objects with respect to combined parameterization data having said second time or frequency resolution, or

determining a parameter band having a second time or frequency resolution different from a first time or frequency resolution used in the time or frequency decomposition of the plurality of audio objects, and for A parameter strip computes parameter data for said at least two related audio objects.

10. The apparatus according to one of the preceding claims, wherein said plurality of audio objects comprises associated metadata indicating directional information (810) about said plurality of audio objects, and

Wherein, the device also includes:

A downmixer (400), configured to downmix the plurality of audio objects to obtain one or more transmission channels, wherein the downmixer (400) is configured to: respond to information about the plurality of audio downmixing the plurality of audio objects using the direction information of the objects; and

a transport channel encoder (300), configured to encode one or more transport channels to obtain one or more encoded transport channels; and

Wherein, the output interface (200) is configured to introduce the one or more transmission channels into the encoded audio signal.

11. The apparatus according to claim 10, wherein the downmixer (400) is configured to:

generating two transmission channels as two virtual microphone signals arranged at the same position and having different orientations, or at two different positions or orientations relative to a reference such as a virtual listener position or orientation at the location, or

generating three transmission channels as three virtual microphone signals arranged at the same position and having different orientations, or at three different positions or orientations relative to a reference such as a virtual listener position or orientation at the location, or

The four transmission channels are generated as four virtual microphone signals arranged at the same position and with different orientations, or at four positions or orientations relative to a reference position or orientation such as a virtual listener position or orientation. at different locations, or

Wherein, the virtual microphone signal is a virtual first-order microphone signal, or a virtual cardioid microphone signal, or a virtual 8-shaped or dipole or two-way microphone signal, or a virtual directional microphone signal, or a virtual subcardioid microphone signal, or a virtual single directional microphone signal, or virtual hypercardioid microphone signal, or virtual omnidirectional microphone signal.

12. The apparatus according to claim 10 or 11, wherein the downmixer (400) is configured to:

For each audio object of the plurality of audio objects, deriving (402) weighting information for each transmission channel using direction information of the corresponding audio object;

weighting (404) an audio object for a particular transport channel using the weighting information of the corresponding audio object to obtain an object contribution for the particular transport channel, and

Combining (406) the object contributions of the plurality of audio objects to the particular transport channel to obtain the particular transport channel.

13. Device according to one of claims 10 to 12,

Wherein, the down-mixer (400) is configured to: calculate the one or more transmission channels as one or more virtual microphone signals, the one or more virtual microphone signals being arranged at the same position and having different orientations , or arranged at a different position relative to a reference position or orientation, such as a virtual listener position or orientation, to which the direction information is related,

wherein said different positions or orientations are on or to the left of said center line and on or to the right of said center line, or wherein said different positions or orientations are distributed evenly or unevenly to horizontal positions or Orientation, such as +90 degrees or -90 degrees relative to the centerline, or -120 degrees, 0 degrees and +120 degrees relative to the centerline, or wherein the different positions or orientations include relative to the virtual listener At least one position or orientation pointing upwards or downwards on the horizontal plane, wherein the direction information about the plurality of audio objects is related to the virtual listener position or reference position or orientation.

14. The apparatus of any one of claims 10 to 13, further comprising:

a parameter processor (110), configured to quantize metadata indicating direction information about the plurality of audio objects, to obtain quantized direction items for the plurality of audio objects,

Wherein, the downmixer (400) is configured to: operate in response to a quantization direction term as the direction information, and

Wherein, the output interface (200) is configured to introduce information about the quantization direction term PA231367 into the encoded audio signal.

15. Device according to one of claims 10 to 14,

Wherein, the downmixer (400) is configured to: perform (410) an analysis on the directional information about the plurality of audio objects, and place (412) one or more virtual microphones according to the results of the analysis to generate transmission channel.

16. Device according to one of claims 10 to 15,

Wherein, the downmixer (400) is configured to: downmix (408) using a downmix rule that is static over multiple time frames, or

wherein the direction information is variable over a plurality of time frames, and wherein the downmixer (400) is configured to: downmix using a downmix rule that is variable over the plurality of time frames mixed (405).

17. The apparatus according to one of claims 10 to 16, the downmixer (400) being configured to downmix in the time domain using a sample-by-sample weighted and combined combination of samples of the plurality of audio objects.

18. A decoder for decoding an encoded audio signal comprising: one or more transmission channel and direction information for a plurality of audio objects, and one or more frequency intervals for a time frame Parameter data of at least two related audio objects, wherein the number of said at least two related audio objects is lower than the total number of said plurality of audio objects, said decoder comprising:

an input interface (600) for providing said one or more transmission channels in a spectral representation having a plurality of frequency bins in said time frame; and

An audio renderer (700), configured to render the one or more transmission channels into a plurality of audio channels using the direction information, so that according to the considering the contribution from the one or more transmission channels based on the first direction information of the at least two related audio objects and the second direction information associated with the second related audio object of the at least two related audio objects, or

Wherein, the audio renderer (700) is configured to: for each frequency interval in the one or more frequency intervals, according to the first associated audio object in the at least two associated audio objects Contributions from the one or more transmission channels are calculated based on the first direction information and second direction information associated with a second related audio object of the at least two related audio objects.

19. The decoder of claim 18,

Wherein, the audio renderer (700) is configured to ignore direction information of audio objects different from the at least two related audio objects for the one or more frequency intervals.

20. A decoder as claimed in claim 18 or 19,

wherein said encoded audio signal comprises an amplitude-related measure (812) for each associated audio object in said parameter data or a combined value (812) associated with at least two associated audio objects, and

Wherein said audio renderer (700) is configured to determine (704) a quantitative contribution of said one or more transmission channels from said magnitude-related measurement or said combined value.

21. A decoder according to claim 20, wherein an encoded signal comprises said combined value in said parameter data, and

Wherein said audio renderer (700) is configured to: determine (704, 733) the contribution of said one or more transmission channels using a combined value of one of the related audio objects and direction information of the one related audio object, as well as

Wherein, the audio renderer (700) is configured to: use the combined value of another related audio object from among the related audio objects in the one or more frequency intervals and the direction information of the another related audio object The derived value determines (704, 735) the contribution of the one or more transmission channels.

22. The decoder according to one of claims 18 to 21, wherein the audio renderer (700) is configured to:

Direct response information is calculated (704) based on the associated audio objects for each frequency bin of the plurality of frequency bins and direction information associated with the associated audio objects in the frequency bins.

23. The decoder of claim 22,

Wherein, the audio renderer (700) is configured to: use diffuseness information included in the metadata, such as a diffuseness parameter, or a decorrelation rule to determine (741) the a diffuse signal for each frequency bin, and combining the direct response determined by the direct response information and the diffuse signal to obtain a spectral domain rendered signal for a channel of the plurality of channels, or

calculating (706) composite information using said direct response information (704) and information (702) about said plurality of audio channels, and applying (727) covariance composite information to said one or more transmission channels, to obtain the plurality of audio channels, or

wherein said direct response information (704) is a direct response vector for each associated audio object, and wherein said covariance synthesis information is a covariance synthesis matrix, and wherein said audio renderer (700) is configured to : performing a matrix operation for each frequency bin when applying (727) said covariance synthesis information.

24. The decoder according to claim 22 or 23, wherein the audio renderer (700) is configured to:

When calculating said direct response information (704), derive a direct response vector for each relevant audio object; and for each relevant audio object, calculate a covariance matrix from each direct response vector,

In computing said covariance composite information, target covariance information is derived (724) from:

a covariance matrix from each of said related audio objects,

power information about the corresponding associated audio object, and

Power information derived from the one or more transmission channels.

25. The decoder of claim 24, wherein the audio renderer (700) is configured to:

In computing said direct response information (704), deriving a direct response vector for each associated audio object; and for each associated audio object, computing (723) a covariance matrix from each direct response vector,

deriving (726) input covariance information from said transmission channel, and

deriving (725a, 725b) blending information from said target covariance information, said input covariance information and information about said plurality of channels, and

Applying (727) the mixing information to the transmission channels of each frequency bin in the time frame.

26. The decoder of claim 25, wherein the result of applying the mixing information for each frequency bin in the time frame is transformed (708) into the time domain to obtain a plurality of audio frequencies in the time domain aisle.

27. The decoder according to one of claims 22 to 26, wherein the audio renderer (700) is configured to:

When decomposing (752) the input covariance matrix derived from said transmission channel, only the main diagonal elements of said input covariance matrix are used, or

performing a decomposition (751) of the target covariance matrix using the direct response matrix and the power matrix of the subject or transmission channel, or

performing (752) the decomposition of the input covariance matrix by taking the root of each main diagonal element of the input covariance matrix, or

computing (753) the regularized inverse of the decomposed input covariance matrix, or

A singular value decomposition is performed (756) in computing the optimal matrix to be used for energy compensation without expanding the identity matrix.

28. A method of encoding a plurality of audio objects and associated metadata indicating directional information about the plurality of audio objects, comprising:

downmixing the plurality of audio objects to obtain one or more transport channels;

encoding the one or more transmission channels to obtain one or more encoded transmission channels; and

outputting an encoded audio signal comprising said one or more encoded transport channels,

Wherein, the downmixing includes downmixing the plurality of audio objects in response to direction information about the plurality of audio objects.

29. A method of decoding an encoded audio signal comprising: one or more transmission channels and direction information of a plurality of audio objects, and at least two of one or more frequency intervals for a time frame Parameter data of related audio objects, wherein the number of said at least two related audio objects is lower than the total number of said plurality of audio objects, said decoding method comprising:

providing said one or more transmission channels in a spectral representation having a plurality of frequency bins in said time frame; and

rendering the one or more transport channel audio into a plurality of audio channels using the direction information,

Wherein, the audio rendering includes: for each frequency interval of the one or more frequency intervals, according to the first direction information associated with the first related audio object among the at least two related audio objects and according to Computing the contribution from the one or more transmission channels with second direction information associated with a second related audio object of the at least two related audio objects, or making the contribution from the at least two related audio objects According to the first direction information associated with the first related audio object of the at least two related audio objects and the second direction information associated with the second related audio object of the at least two related audio objects, considering the transmission channel from the one or more contribute.

30. A computer program for carrying out the method according to claim 28 or the method according to claim 29 when run on a computer or a processor.

31. An encoded audio signal comprising information on parametric data of at least two related audio objects of one or more frequency intervals.

32. The encoded audio signal of claim 31 , further comprising:

one or more encoded transport channels,

Two or more coded object identities for each of one or more of the plurality of frequency bins in the time frame, and one or more coded combination values or coded amplitude related measures of the associated audio object value, as information about the parameter data, and