CN116648931A

CN116648931A - Apparatus and method for encoding multiple audio objects using direction information during downmixing or decoding using optimized covariance synthesis

Info

Publication number: CN116648931A
Application number: CN202180077244.8A
Authority: CN
Inventors: 安德里亚·艾肯塞尔; 斯里坎斯·科塞; 斯特凡·拜尔; 法比恩·屈希; 奥利弗·迪尔加特; 纪尧姆·福克斯; 多米尼克·韦克贝克; 于尔根·赫勒; 马库斯·马特拉斯
Original assignee: Fraunhofer Gesellschaft zur Forderung der Angewandten Forschung eV
Current assignee: Fraunhofer Gesellschaft zur Forderung der Angewandten Forschung eV
Priority date: 2020-10-13
Filing date: 2021-10-12
Publication date: 2023-08-25

Abstract

An apparatus for encoding a plurality of audio objects and associated metadata indicative of directional information about the plurality of audio objects, comprising: a downmixer (400) for downmixing the plurality of audio objects to obtain one or more transmission channels; a transmission channel encoder (300) for encoding one or more transmission channels to obtain one or more encoded transmission channels; and an output interface (200) for outputting an encoded audio signal comprising the one or more encoded transmission channels, wherein the downmixer (400) is configured to downmix a plurality of audio objects in response to direction information about the plurality of audio objects.

Description

Apparatus and method for encoding multiple audio objects using direction information during downmixing or decoding using optimized covariance synthesis

Technical Field

The present invention relates to the encoding of audio signals (e.g., audio objects) and the decoding of encoded audio signals (e.g., encoded audio objects).

Background

Introduction to the invention

This document describes a parameterized method of encoding and decoding object-based audio content at low bit rates using directional audio coding (DirAC). The presented embodiments are used as part of a 3GPP Immersive Voice and Audio Services (IVAS) codec and provide an advantageous alternative to the low bit rate Independent Stream (ISM) mode with metadata (a discrete coding method).

Prior Art

Discrete encoding of objects

The most straightforward method of encoding object-based audio content is to encode and send the objects separately along with the corresponding metadata. The main disadvantages of this method are: as the number of objects increases, the bit consumption required to encode the objects is too high. A simple solution to this problem is to employ a "parameterization method" in which some relevant parameters are calculated from the input signal, quantized and transmitted together with a suitable downmix signal combining several object waveforms.

Spatial Audio Object Coding (SAOC)

Spatial audio object coding [ saoc_std, saoc_aes ] is a parameterization method in which an encoder calculates a downmix signal based on a certain downmix matrix D and a parameter set and sends both to a decoder. These parameters represent psycho-acoustic related properties and relationships of all individual objects. At the decoder, the downmix is rendered to a specific speaker layout using a rendering matrix R.

The main parameter of SAOC is an object covariance matrix E of size N x N, where N refers to the number of objects. The parameter is transmitted to the decoder as an Object Level Difference (OLD) and optionally an inter-object covariance (IOC).

Each element E of matrix E _i,j Given by the formula:

object Level Differences (OLDs) are defined as

Wherein, the liquid crystal display device comprises a liquid crystal display device,and absolute object energy (NRG) is described as

And

wherein i and j are each an object x _i And x _j N indicates a time index, and k indicates a frequency index. l indicates a time index set, and m indicates a frequency index set. Epsilon is an additional constant that avoids division by zero, e.g., epsilon=10.

The similarity measure of the Input Objects (IOCs) may be given by cross correlation, for example:

the downmix matrix D of size n_ dmx ×n consists of the element D _i,j Is defined, where i refers to the channel index of the downmix signal and j refers to the object index. For stereo down-mix (n_ dmx =2), d _i,j Calculated from the parameters DMG and DCLD as

Wherein, DMG _i And DCLD _i Given by the formula:

d for the mono downmix (n_ dmx =1) case _i,j Calculated as based on DMG parameters only

Wherein, the liquid crystal display device comprises a liquid crystal display device,

spatial audio object coding-3D (SAOC-3D)

Spatial audio object coding 3D audio reproduction (SAOC-3D) [ mpegh_aes, mpegh_ieee, mpegh_std, saoc_3d_pat ] is an extension of the above-described MPEG SAOC technique, which compresses and renders channel and object signals in a very efficient manner at a bit rate.

The main differences with SAOC are:

although the original SAOC only supports up to two downmix channels, SAOC-3D may map multi-object inputs to any number of downmix channels (and associated side information).

Compared to classical SAOC, which already uses MPEG surround as a multi-channel output processor, rendering directly to the multi-channel output.

Some tools, such as residual coding tools, are discarded.

Despite these differences, from a parametric point of view SAOC-3D is identical to SAOC. The SAOC-3D decoder, similar to the SAOC decoder, receives a multi-channel downmix X, a covariance matrix E, a rendering matrix R and a downmix matrix D.

The rendering matrix R is defined by input channels and input objects and received from a format converter (channel) and an object renderer (object), respectively.

The downmix matrix D is composed of the element D _i,j Is defined, where i refers to the channel index of the downmix signal and j refers to the object index and is calculated from the downmix gain (DMG):

the output covariance matrix C of size n_out is defined as:

C＝RER ^*

correlation scheme

There are several other schemes that are similar in nature to SAOC, but with the following subtle differences:

binaural Cue Coding (BCC) for objects has been described in, for example, [ BCC2001], and is a precursor to SAOC technology.

Joint Object Coding (JOC) and Gao Jilian joint object coding (a-JOC) perform a similar function as SAOC while providing substantially separate objects on the decoder side without rendering them to a specific output speaker layout [ joc_aes, ac4_aes ]. This technique sends elements of the upmix matrix from the downmix to the separate objects as parameters (instead of the OLD).

Directional audio coding (DirAC)

Another parameterization method is directional audio coding. DirAC [ Pulkki2009] is a perceptually driven reproduction of spatial sound. Assume that: at one time instance and for one critical frequency band, the spatial resolution of the human auditory system is limited to decoding one directional cue and the other inter-aural coherence cue.

Based on these assumptions, dirAC represents spatial sound in one frequency band by fading in and out two streams: a non-directional diffusion flow and a directional non-diffusion flow. DirAC processing is performed in two phases: analysis and synthesis as shown in fig. 12a and 12 b.

In the DirAC analysis stage, a first order coincidence microphone of the B format is taken as input, and the diffuseness and direction of arrival of sound are analyzed in the frequency domain.

In the DirAC synthesis stage, the sound is divided into two streams, a non-diffuse stream and a diffuse stream. Non-diffuse streams are rendered as point sources using amplitude panning (panning), which can be done using vector-based amplitude panning (VBAP) [ Pulkki 1997 ]. The diffuse flow is responsible for producing the sense of enclosure and is produced by delivering the mutually decorrelated signals to the loudspeakers.

The analysis stage in fig. 12a comprises a band filter 1000, an energy estimator 1001, an intensity estimator 1002, time averaging elements 999a and 999b, a diffuseness calculator 1003 and a direction calculator 1004. The calculated spatial parameters are a diffuseness value between 0 and 1 for each time/frequency zone and a direction of arrival parameter for each time/frequency zone generated by block 1004. In fig. 12a, the direction parameters include azimuth and elevation angles indicating the direction of arrival of sound with respect to a reference or listening position and in particular with respect to the position at which the microphone is located, from which position the four component signals input to the band filter 1000 are collected. In fig. 12a, these component signals are first order surround sound components comprising an omni-directional component W, a directional component X, another directional component Y and a further directional component Z.

The DirAC synthesis stage shown in fig. 12B comprises a bandpass filter 1005 for generating a time/frequency representation of the B-format microphone signal W, X, Y, Z. The corresponding signals for the respective time/frequency regions are input to a virtual microphone stage 1006, which virtual microphone stage 1006 generates a virtual microphone signal for each channel. Specifically, in order to generate a virtual microphone signal, e.g. of a central channel, the virtual microphone is directed in the direction of the central channel, and the resulting signal is the corresponding component signal of the central channel. The signal is then processed via direct signal branch 1015 and diffuse signal branch 1014. Both branches comprise corresponding gain adjusters or amplifiers, which are controlled in blocks 1007, 1008 by a diffuseness value derived from the original diffuseness parameter and are further processed in blocks 1009, 1010 to obtain a certain microphone compensation.

The component signals in the direct signal branch 1015 are also gain adjusted using gain parameters derived from direction parameters consisting of azimuth and elevation. Specifically, these angles are input into a VBAP (vector base amplitude panning) gain table 1011. For each channel, the result is input to a speaker gain averaging stage 1012 and to a further normalizer 1013, and the resulting gain parameters are forwarded to an amplifier or gain adjuster in the direct signal branch 1015. In the combiner 1017 the diffuse signal generated at the output of the decorrelator 1016 and the direct signal or the non-diffuse stream are combined and then the other sub-bands are added in a further combiner 1018, which further combiner 1018 may be, for example, a synthesis filter bank. Thus, a speaker signal for a certain speaker is generated, and the same process is performed for other channels of other speakers 1019 in a certain speaker setting.

The high quality version of DirAC synthesis is shown in fig. 12B, where the synthesizer receives all B-format signals from which the virtual microphone signal is calculated for each loudspeaker direction. The pattern used is typically a dipole. The virtual microphone signal is then modified in a nonlinear manner depending on the metadata discussed with respect to branches 1016 and 1015. The low bit rate version of DirAC is not shown in fig. 12 b. However, in this low bit rate version, a single audio channel is sent. The processing is different in that: all virtual microphone signals will be replaced by a single audio channel received. The virtual microphone signal is divided into two streams, a diffuse stream and a non-diffuse stream, which are processed separately. Non-diffuse sound is rendered as a point source by using Vector Base Amplitude Panning (VBAP). In panning, the mono sound signal is applied to the subset of speakers after being multiplied by the speaker-specific gain factor. The gain factor is calculated using information of the speaker settings and the specified panning direction. In the low bit rate version, the input signal is simply translated into the direction implied by the metadata. In the high quality version, each virtual microphone signal is multiplied by a corresponding gain factor, which produces the same effect as panning, but it is less prone to any nonlinear artifacts.

The purpose of diffuse sound synthesis is to create a sound perception around the listener. In the low bit rate version, the diffuse stream is reproduced by de-correlating the input signal and reproducing it from each speaker. In the high quality version, the virtual microphone signals of the diffuse streams are already to some extent incoherent and only need to be de-correlated slightly.

DirAC parameters (also known as spatial metadata) consist of tuples of diffuseness and direction, which in spherical coordinates is represented by two angles (i.e. azimuth and elevation). If both the analysis stage and the synthesis stage are running on the decoder side, the time-frequency resolution of the DirAC parameters may be chosen to be the same as the filter bank used for DirAC analysis and synthesis, i.e. a different set of parameters for each time slot and frequency interval of the filter bank representation of the audio signal.

Some work has been done to reduce the size of metadata to enable the DirAC paradigm to be used for spatial audio coding and teleconferencing scenarios [ Hirvonen2009].

In WO2019068638 a universal spatial audio coding system based on DirAC is described. In contrast to classical DirAC designed for B-format (first order surround sound format) input, the system can accept first order or higher order surround sound, multi-channel or object-based audio input, and also allow mixed type input signals. All signal types are efficiently encoded and transmitted, either alone or in combination. The former combines different representations at the renderer (decoder side) and the latter uses encoder side combinations of different audio representations in the DirAC domain.

Compatibility with DirAC framework

This embodiment builds on a unified framework of arbitrary input types as proposed in [ WO2019068638], and-like that done for multi-channel content in [ WO2020249815], aims to eliminate the problem of not being able to apply DirAC parameters (direction and diffuseness) efficiently to object inputs. In fact, the diffuseness parameter is not needed at all, but it was found that a single directional hint per time/frequency unit is not sufficient to reproduce high quality object content. This embodiment therefore suggests that each time/frequency unit employs a plurality of directional cues and accordingly introduces an adapted parameter set that replaces the classical DirAC parameters in case of object input.

Flexible system for low bit rate

In contrast to DirAC, which uses scene-based representations from the perspective of the listener, SAOC and SAOC-3D are designed for channel-based and object-based content, where parameters describe the relationship between channels/objects. In order to use the scene-based representation for object input and thus be compatible with the DirAC renderer, while ensuring an efficient representation and a high quality rendering, an adapted parameter set is required in order to also allow signaling of multiple directional cues.

An important objective of this embodiment is to find a method for efficiently encoding object inputs at low bit rates and with good scalability to more and more objects. Discrete encoding of each object signal does not provide this scalability: each additional object results in a significant increase in overall bit rate. If the number of objects increased exceeds the allowed bit rate, this will directly result in a very significant degradation of the output signal; this degradation is another argument supporting this embodiment.

It is an object of the present invention to provide an improved concept for encoding a plurality of audio objects or decoding an encoded audio signal.

This object is achieved by an apparatus for encoding according to claim 1, a decoder according to claim 18, an encoding method according to claim 28, a decoding method according to claim 29, a computer program according to claim 30 or an encoded audio signal according to claim 31.

In one aspect of the invention, the invention is based on the following findings: for one or more of the plurality of frequency bins, at least two related audio objects are defined and parameter data related to the at least two related objects is included at the encoder side and used at the decoder side to obtain a high quality and efficient audio encoding/decoding concept.

According to another aspect of the invention, the invention is based on the following findings: a specific downmixing suitable for the direction information associated with each object is performed such that each object having associated direction information valid for the entire object (i.e. for all frequency bins in a time frame) is used to downmix the object into a plurality of transmission channels. The use of directional information is for example equivalent to generating the transmission channel as a virtual microphone signal with certain adjustable characteristics.

On the decoder side, a specific synthesis is performed which depends on the covariance synthesis, which in a specific embodiment is particularly suitable for high quality covariance synthesis which is not affected by the pseudo-tones introduced by the decorrelator. In other embodiments, advanced covariance synthesis is used that relies on specific improvements associated with standard covariance synthesis in order to improve audio quality and/or reduce the amount of computation required to calculate the mixing matrix used within covariance synthesis.

However, even in more classical synthesis of audio rendering by explicitly determining the individual contributions within the time/frequency interval based on the transmitted selection information, the audio quality is superior to the prior art object coding method or channel downmix method. In this case, each time/frequency interval has object identification information, and when audio rendering is performed, i.e., when the directional contribution of each object is considered, the direction associated with the object information is searched for using the object identification in order to determine the gain value of the respective output channel of each time/frequency interval. Thus, when there is only a single relevant object in the time/frequency interval, then only the gain value of that single object is determined for each time/frequency interval based on the object ID and the "codebook" of direction information of the associated object.

However, when there are more than 1 correlation object in the time/frequency interval, then the gain value of each correlation object is calculated so as to allocate the corresponding time/frequency interval of the transmission channel into a corresponding output channel controlled by the output format provided by the user (e.g., a certain channel format is a stereo format, 5.1 format, etc.). Whether the gain values are used for covariance synthesis purposes (i.e., for the purpose of applying a mixing matrix to mix the transmission channels into the output channels), or whether the gain values are used to explicitly determine individual contributions of each object in the time/frequency intervals by multiplying the gain values by the corresponding time/frequency intervals of one or more transmission channels and then summing the contributions of each output channel in the corresponding time/frequency intervals (possibly enhanced by adding diffuse signal components), the output audio quality is still enhanced due to the flexibility provided by determining one or more relevant objects for each frequency interval.

Such determination may be very efficient because only one or more object IDs of the time/frequency interval have to be encoded and sent to the decoder together with the direction information of each object, however, this may also be very efficient. This is because: for a frame, only a single direction information exists for all frequency bins.

Thus, an efficient and high quality object downmix is obtained, whether the synthesis is performed using a preferably enhanced covariance synthesis or using a combination of explicit transmission channel contributions per object, which is preferably enhanced by using a specific object-direction dependent downmix, which depends on the downmix weights reflecting the generation of the transmission channels as virtual microphone signals.

Aspects related to two or more related objects per time/frequency interval may preferably be combined with aspects performing a specific direction-dependent downmixing of objects to the transmission channel. However, these two aspects can also be applied independently of each other. Furthermore, while in some embodiments covariance synthesis of two or more correlated objects per time/frequency interval is performed, advanced covariance synthesis and advanced transmission channel-to-output channel upmixing may also be performed by transmitting only a single object identification per time/frequency interval.

Furthermore, whether there are a single or several related objects per time/frequency interval, the upmixing may also be performed by computing a mixing matrix in standard or enhanced covariance synthesis, or the upmixing may be performed by determining the contributions of the time/frequency intervals individually (based on the object identification used to obtain the specific direction information from the direction "codebook") to determine the gain value of the corresponding contribution. These are then summed to obtain a complete contribution for each time/frequency interval in the event that there are two or more related objects for each time/frequency interval. The output of this summing step is then identical to the output of the mixing matrix application and a final filter bank processing is performed to generate the time domain output channel signal in the corresponding output format.

Drawings

Preferred embodiments of the present invention are described hereinafter with reference to the accompanying drawings, in which:

FIG. 1a is an implementation of an audio encoder according to the first aspect with at least two related objects per time/frequency interval;

FIG. 1b is an implementation of an encoder according to a second aspect with direction dependent object downmixing;

FIG. 2 is a preferred implementation of an encoder according to the second aspect;

FIG. 3 is a preferred implementation of an encoder according to the first aspect;

FIG. 4 is a preferred implementation of the decoder according to the first and second aspects;

FIG. 5 is a preferred implementation of the covariance synthesis process of FIG. 4;

FIG. 6a is an implementation of a decoder according to the first aspect;

fig. 6b is a decoder according to a second aspect;

fig. 7a is a flow chart for illustrating the determination of parameter information according to the first aspect;

FIG. 7b is a preferred implementation of further determining parametric data;

FIG. 8a shows a high resolution filter bank time/frequency representation;

fig. 8b shows the transmission of relevant side information for frame J according to a preferred implementation of the first and second aspects;

FIG. 8c shows a "directional codebook" included in an encoded audio signal;

FIG. 9a shows a preferred encoding scheme according to the second aspect;

FIG. 9b shows an implementation of static downmixing according to the second aspect;

FIG. 9c shows an implementation of dynamic downmixing according to the second aspect;

FIG. 9d shows another embodiment of the second aspect;

fig. 10a shows a flow chart of a preferred implementation of the decoder side of the first aspect;

FIG. 10b illustrates a preferred implementation of the output channel calculation of FIG. 10a according to an embodiment in which the contributions of each output channel are summed;

fig. 10c shows a preferred manner of determining power values for a plurality of related objects according to the first aspect;

FIG. 10d illustrates an embodiment of computing the output channels of FIG. 10a using covariance synthesis depending on the computation and application of a mixing matrix;

FIG. 11 illustrates several embodiments for advanced computation of a mixing matrix for time/frequency bins;

FIG. 12a shows a prior art DirAC encoder; and

fig. 12b shows a prior art DirAC decoder.

Detailed Description

Fig. 1a shows an apparatus for encoding a plurality of audio objects, which receives the audio objects as such and/or metadata of the audio objects at an input. The encoder comprises an object parameter calculator 100 which provides parameter data of at least two related audio objects for a time/frequency interval and forwards the data to an output interface 200. Specifically, the object parameter calculator calculates parameter data of at least two related audio objects for one or more of a plurality of frequency bins related to the time frame, wherein specifically the number of the at least two related audio objects is lower than the total number of the plurality of audio objects. Thus, the object parameter calculator 100 actually performs the selection, not only indicating that all objects are related. In a preferred embodiment, the selection is made by correlation, and the correlation is determined by an amplitude-dependent measurement (e.g., amplitude, power, loudness, or another measurement obtained by increasing the amplitude to a power different from 1 and preferably greater than 1). Then, if a certain number of related objects are available for the time/frequency interval, the object having the most relevant characteristics (i.e., the object having the highest power among all the objects) is selected, and data on these selected objects is included in the parameter data.

The output interface 200 is configured to output an encoded audio signal comprising information about parametric data of at least two related audio objects of one or more frequency bins. Depending on the implementation, the output interface may receive other data (e.g., object downmixes, or one or more transmission channels representing object downmixes, or additional parameters, or object waveform data in the form of a mixed representation of several objects being downmixed, or other objects in the form of separate representations) and input it into the encoded audio signal. In this case, the object is directly introduced or "copied" into the corresponding transmission channel.

Fig. 1b shows a preferred implementation of the apparatus for encoding a plurality of audio objects according to the second aspect, wherein audio objects and related object metadata are received, the related object metadata indicating direction information about the plurality of audio objects, i.e. one direction information for each object or group of objects if the group of objects has the same direction information associated therewith. The audio objects are input into a downmixer 400 to downmix a plurality of audio objects to obtain one or more transmission channels. Further, a transmission channel encoder 300 is provided, which transmission channel encoder 300 encodes one or more transmission channels to obtain one or more encoded transmission channels, which are then input into the output interface 200. Specifically, the downmixer 400 is connected to the object direction information provider 110, which object direction information provider 110 receives at input any data from which object metadata can be derived, and outputs direction information actually used by the downmixer 400. The direction information forwarded from the object direction information provider 110 to the downmix 400 is preferably dequantized direction information, i.e. the same direction information is then available at the decoder side. To this end, the object direction information provider 110 is configured to derive or extract or obtain non-quantized object metadata, which is then quantized to derive quantized object metadata representing quantization indices in the "other data" shown in fig. 1b, which in a preferred embodiment are provided to the output interface 200. Further, the object direction information provider 110 is configured to dequantize the quantized object direction information to obtain the actual direction information forwarded from the block 110 to the down-mixer 400.

Preferably, the output interface 200 is configured to additionally receive parameter data of the audio object, object waveform data, one or several identifications of single or multiple related objects per time/frequency interval, and quantization direction data as previously discussed.

Subsequently, other embodiments are shown. A parameterized method for encoding an audio object signal is proposed, which parameterized method allows efficient transmission at low bit rates and high quality reproduction at the consumer side. The most dominant object is determined for each such time/frequency region of the time/frequency representation of the input signal based on DirAC principle taking into account one directional hint for each critical frequency band and time instance (time/frequency region). Since this proves to be insufficient for object input, an additional, second most dominant object is determined per time/frequency zone, and based on both objects, the power ratio is calculated to determine the effect of each of the two objects on the considered time/frequency zone.Note that:it is also conceivable to consider more than two most significant objects per time/frequency unit, in particular for more and more input objects. For simplicity, the following description is based primarily on two main objects per time/frequency unit.

Thus, the parameterized side information sent to the decoder includes:

power ratio calculated for a subset of related (primary) objects per time/frequency zone (or parameter band).

Object index representing a subset of related objects for each time/frequency zone (or parameter band).

Directional information associated with the object index and provided for each frame (where each time domain frame includes a plurality of parameter bands, and each parameter band includes a plurality of time/frequency regions).

The direction information may be obtained via an input metadata file associated with the audio object signal. For example, metadata may be specified based on the frame. In addition to the side information, a downmix signal combining the input object signals is also transmitted to the decoder.

During the rendering phase, the transmitted direction information (derived via the object index) is used to translate the transmitted downmix signal (or more generally: the transmission channel) into the appropriate direction. The downmix signal is distributed to the two related object directions based on the transmitted power ratio, which is used as a weighting factor. This processing is done for each time/frequency region of the time/frequency representation of the decoded downmix signal.

This section summarizes the encoder-side processing and then details the parameters and the downmix calculation. An audio encoder receives one or more audio object signals. For each audio object signal, a metadata file describing the object properties is associated. In this embodiment, the object properties described in the associated metadata file correspond to direction information provided on a frame basis, where one frame corresponds to 20 milliseconds. Each frame is identified by a frame number and is also contained in the metadata file. The direction information is given in the form of azimuth and elevation information, where azimuth takes value from (-180, 180) degrees and elevation takes value from [ -90, 90] degrees.

The information provided in the metadata file is used together with the actual audio object file to create a parameter set that is sent to the decoder and used to render the final audio output file. More specifically, the encoder estimates parameters of the main object subset, i.e., the power ratio, for each given time/frequency region. The main object subset is represented by an object index, which is also used to identify the object direction. These parameters are sent to the decoder along with the transmission channel and direction metadata.

Fig. 2 gives an overview of an encoder in which the transmission channel comprises a downmix signal calculated from an input object file and directional information provided in input metadata. The number of transmission channels is always smaller than the number of input target files. In the encoder of the embodiment, the encoded audio signal is represented by an encoded transmission channel, and the encoding parameterization auxiliary information is indicated by an encoding object index, an encoding power ratio, and encoding direction information. Both the encoded transmission channel and the encoded parametric side information together form a bit stream that is output by the multiplexer 220. Specifically, the encoder includes a filter bank 102 that receives an input object audio file. In addition, the object metadata file is provided to the extractor direction information block 110a. The output of the block 110a is input into a quantization direction information block 110b, which quantization direction information block 110b outputs direction information to the down-mixer 400 which performs the down-mixing calculation. Furthermore, the quantized direction information (i.e. quantization index) is forwarded from block 110b to the coding direction information 202 block, which coding direction information 202 block preferably performs some entropy coding in order to further reduce the required bit rate.

Further, the output of the filter bank 102 is input to the signal power calculation block 104, and the output of the signal power calculation block 104 is input to the object selection block 106 and additionally to the power ratio calculation block 108. The power ratio calculation block 108 is also connected to the object selection block 106 in order to calculate the power ratio (i.e. the combined value for the selected object only). In block 210, the calculated power ratio or combined value is quantized and encoded. As will be outlined later, the power ratio is preferred in order to preserve the transmission of one power data item. However, in other embodiments where such preservation is not required, instead of the power ratio, the actual signal power or other value derived from the signal power determined by block 104 may be input into the quantizer and encoder at the option of object selector 106. Then, the power ratio calculation 108 is not required, and the object selection 106 ensures that only relevant parametric data (i.e. power related data of relevant objects) is input into the block 210 for quantization and encoding purposes.

Comparing fig. 1a with fig. 2, blocks 102, 104, 110a, 110b, 106, 108 are preferably included in the object parameter calculator 100 of fig. 1a, and blocks 202, 210, 220 are preferably included in the output interface block 200 of fig. 1 a.

Further, the core encoder 300 in fig. 2 corresponds to the transmission channel encoder 300 of fig. 1b, the downmix calculation block 400 corresponds to the downmix 400 of fig. 1b, and the object direction information provider 110 of fig. 1b corresponds to the blocks 110a, 110b of fig. 2. Furthermore, the output interface 200 of fig. 1b is preferably implemented in the same way as the output interface 200 of fig. 1a and comprises the blocks 202, 210, 220 of fig. 2.

Fig. 3 shows an encoder variant in which the downmix calculation is optional and independent of the input metadata. In this variant, the input audio files may be fed directly into the core encoder, which creates transmission channels from them, so that the number of transmission channels corresponds to the number of input object files; this is particularly interesting if the number of input objects is 1 or 2. For a large number of objects, the downmix signal will still be used to reduce the amount of data to be transmitted.

In fig. 3, like reference numerals refer to like functions of fig. 2. This is valid not only for fig. 2 and 3, but also for all other figures described in this specification. Unlike fig. 2, fig. 3 performs the downmix calculation 400 without any direction information. Thus, the downmix calculation may be a static downmix, e.g. using a previously known downmix matrix, or may be an energy-dependent downmix that does not depend on any directional information associated with the objects included in the input object audio file. However, the direction information is extracted in block 110a and quantized in block 110b, and the quantized values are forwarded to the direction information encoder 202 for having encoded direction information in an encoded audio signal (e.g., a binary encoded audio signal) forming a bitstream.

The downmix calculation block 400 may also be omitted in case of having not too many input audio object files or in case of having sufficient available transmission bandwidth, such that the input audio object files directly represent the transmission channels encoded by the core encoder. In such an implementation, the blocks 104, 106, 108, 210 are also not required. However, the preferred implementation results in a hybrid implementation in which some objects are directly introduced into the transmission channels, while other objects are downmixed into one or more transmission channels. In this case, then, all of the blocks shown in fig. 3 would be necessary to directly generate a bitstream having one or more objects within the encoded transmission channel and one or more transmission channels generated by the downmixer 400 of fig. 2 or 3.

Parameter calculation

The time domain audio signal comprising all input object signals is converted into the time domain/frequency domain using a filter bank. For example: the CLDFB (complex low delay filter bank) analysis filter converts a 20 millisecond frame (corresponding to 960 samples at a sampling rate of 48 kHz) into a time/frequency region of size 16x60 with 16 slots and 60 bands. For each time/frequency unit, the instantaneous signal power is calculated as:

P _i (k，n)＝|X _i (k，n)| ² ，

Where k denotes a band index, n denotes a slot index, and i denotes an object index. Since the transmission parameters for each time/frequency zone are very costly in terms of final bit rate, packets are employed in order to calculate parameters for a reduced number of time/frequency zones. For example: the 16 slots may be combined together into a single slot, and the 60 frequency bands may be combined into 11 frequency bands based on psychoacoustic scale. This reduces the initial size of 16x60 to 1x11 corresponding to 11 so-called parameter bands. The instantaneous signal power values are summed on a packet basis to obtain a reduced-dimension signal power:

wherein T corresponds to 15 in this example, and B _S And B _E Parameter band boundaries are defined.

In order to determine the subset of the most significant objects for which parameters are calculated, the instantaneous signal power values of all N input audio objects are ordered in descending order. In this embodiment, we determine the two most significant objects and the corresponding object index ranging from 0 to N-1 is stored as part of the parameters to be transmitted. Furthermore, a power ratio is calculated that correlates the two main object signals to each other:

/>

or in a more general expression that is not limited to two objects:

Where, in this context, S represents the number of primary objects to be considered, and:

in the case of two main objects, a power ratio of 0.5 means that for each of the two objects, both objects are also present within the corresponding parameter band, while a power ratio of 1 and 0 means that one of the two objects is absent. These power ratios are stored as a second part of the parameters to be transmitted. Since the sum of the power ratios is 1, it is sufficient to transmit S-1 values instead of S.

In addition to the object index and the power ratio value of each parameter band, the direction information of each object extracted from the input metadata file must be transmitted. Since the information is initially provided on a frame basis, this is done for each frame (where each frame includes 11 parameter bands or a total of 16x60 time/frequency regions in the described example). Thus, the object index indirectly represents the object direction. Note that: since the sum of the power ratios is 1, the number of power ratios to be transmitted per parameter band can be reduced by 1; for example: in the case of considering 2 related objects, it is sufficient to transmit 1 power ratio.

Both the direction information and the power ratio are quantized and combined with the object index to form parameterized side information. The parametric side information is then encoded and mixed into the final bitstream representation together with the encoded transmission channel/downmix signal. For example, by quantizing the power ratio using 3 bits per value, a good tradeoff between output quality and extended bit rate is achieved. The direction information may be provided with an angular resolution of 5 degrees and then quantized with 7 bits per azimuth value and 6 bits per elevation value to give a practical example.

Down-mixing calculation

All input audio object signals are combined into a downmix signal comprising one or more transmission channels, wherein the number of transmission channels is smaller than the number of input object signals. Note that: in this embodiment, if there is only one input object, only a single transmission channel occurs, which thus means that the downmix calculation is skipped.

If the downmix comprises two transmission channels, the stereo downmix may for example be calculated as a virtual cardioid microphone signal. The virtual cardioid microphone signal is determined by applying the provided direction information for each frame in the metadata file (here, all elevation values are assumed to be zero):

w _L =0.5+0.5×cos (azimuth-p _i /2)

w _R =0.5+0.5×cos (azimuth +p _i /2)

Here, the virtual heart shape is located at 90 ° and-90 °. Thus determining individual weights for each of the two transmission channels (left and right) and applying these individual weights to the corresponding audio object signals:

in this context, N is the number of input objects greater than or equal to 2. If the virtual heart weights are updated for each frame, a dynamic downmixing appropriate for the direction information is employed. Another possibility is to employ a fixed downmix, where each object is assumed to be at a static location. The static position may for example correspond to the initial direction of the object, which then results in a static virtual heart weight that is the same for all frames.

More than two transmission channels may be envisaged if the target bit rate allows. In the case of three transmission channels, the heart shape may then be arranged uniformly at, for example, 0 °, 120 ° and-120 °. If four transmission channels are used, the fourth heart may be facing upwards or the four heart may again be arranged horizontally in a uniform manner. The arrangement may also be tailored to the object position if the object position is, for example, only a part of one hemisphere. The resulting downmix signal is processed by a core encoder and is converted into a bitstream representation together with encoding parametric side information.

Alternatively, the input object signals may be fed into the core encoder without being combined into a downmix signal. In this case, the number of the resulting transmission channels corresponds to the number of the input object signals. Typically, the maximum number of transmission channels is given in relation to the total bit rate. Then, the downmix signal is employed only when the number of input object signals exceeds the maximum number of transmission channels.

Fig. 6a shows a decoder for decoding an encoded audio signal (e.g. the signal output by fig. 1a or fig. 2 or fig. 3) comprising direction information of a plurality of audio objects and one or more transmission channels. Furthermore, for one or more frequency bins of the time frame, the encoded audio signal comprises parameter data of at least two related audio objects, wherein the number of at least two related objects is lower than the total number of the plurality of audio objects. In particular, the decoder comprises an input interface for providing one or more transmission channels in the form of a spectral representation having a plurality of frequency bins in a time frame. This represents the signal forwarded from the input interface block 600 to the audio renderer block 700. In particular, the audio renderer 700 is configured to render one or more transmission channels into a plurality of audio channels using directional information included in the encoded audio signal, the number of audio channels preferably being two channels for a stereo output format or preferably being more than two channels for a larger number of output formats (e.g. 3 channels, 5 channels, 5.1 channels, etc.). Specifically, the audio renderer 700 is configured to: for each of the one or more frequency bins, a contribution of the one or more transmission channels is calculated from first direction information associated with a first of the at least two related audio objects and from second direction information associated with a second of the at least two related audio objects. Specifically, the direction information of the plurality of audio objects includes first direction information associated with the first object and second direction information associated with the second object.

Fig. 8b shows parameter data of a frame, which in a preferred embodiment comprises direction information 810 of a plurality of audio objects, and further comprises a power ratio for each of a number of parameter bands shown at 812, and one (preferably two or even more) object index for each parameter band indicated at block 814. In particular, the directional information of the plurality of audio objects 810 is shown in more detail in fig. 8 c. Fig. 8c shows a table with a first column of some object IDs from 1 to N, where N is the number of multiple audio objects. Furthermore, a second column is provided with the direction information of each object, preferably as azimuth value and elevation value, or in the case of two dimensions, only as azimuth value. This is shown at 818. Thus, fig. 8c shows a "directional codebook" included in the encoded audio signal, which is input into the input interface 600 of fig. 6 a. The direction information from column 818 is uniquely associated with a certain object ID from column 816 and is valid for the "whole" object in the frame (i.e., for all bands in the frame). Thus, no matter whether the number of frequency bins is the time/frequency bin in the high resolution representation or the time/parameter band in the lower resolution representation, only a single direction information is sent and input interface is used for each object identification.

In this context, fig. 8a shows a time/frequency representation generated by the filter bank 102 of fig. 2 or 3 when this filter bank is implemented as the CLDFB (complex low delay filter bank) discussed previously. For frames giving direction information as discussed previously with respect to fig. 8b and 8c, the filter bank generates 16 time slots ranging from 0 to 15 and 60 frequency bands ranging from 0 to 59 in fig. 8 a. Thus, one slot and one band represent the time/frequency region 802 or 804. However, in order to reduce the bit rate of the side information, it is preferable to convert the high resolution representation to a low resolution representation, as shown in fig. 8b, wherein there is only a single time interval, and wherein 60 frequency bands are converted to 11 parameter bands as shown at 812 in fig. 8 b. Thus, as shown in fig. 10c, a high resolution representation is indicated by the slot index n and the band index k, and a low resolution representation is given by the packet slot index m and the parameter band index l. However, in the context of this specification, the time/frequency interval may include the high resolution time/frequency regions 802, 804 of fig. 8a or the low resolution time/frequency units identified by the packet slot index and parameter band index at the input of block 731c in fig. 10 c.

In the embodiment of fig. 6a, the audio renderer 700 is configured to: for each of the one or more frequency bins, a contribution of the one or more transmission channels is calculated from first direction information associated with a first of the at least two related audio objects and from second direction information associated with a second of the at least two related audio objects. In the embodiment shown in fig. 8b, block 814 has an object index for each related object in the parameter band, i.e. has two or more object indices, such that there are two contributions per time-frequency interval.

As will be outlined later with respect to fig. 10a, the calculation of the contribution may be done indirectly via a mixing matrix, wherein the gain value of each relevant object is determined and used to calculate the mixing matrix. Alternatively, as shown in fig. 10b, the contributions may be calculated explicitly again using the gain values, and then the explicitly calculated contributions are summed for each output channel in a certain time/frequency interval. Thus, whether the contribution is explicitly calculated or implicitly calculated, the audio renderer still renders the one or more transmission channels into the plurality of audio channels using the direction information such that, for each of the one or more frequency bins, the contribution of the one or more transmission channels is included in the plurality of audio channels according to the first direction information associated with a first of the at least two related audio objects and according to the second direction information associated with a second of the at least two related audio objects.

Fig. 6b shows a decoder for decoding an encoded audio signal according to the second aspect, the encoded audio signal comprising: direction information of a plurality of audio objects and one or more transmission channels; and parameter data of the audio object for one or more frequency bins of the time frame. Also, the decoder includes an input interface 600 for receiving an encoded audio signal, and the decoder includes an audio renderer 700 for rendering one or more transmission channels into a plurality of audio channels using direction information. Specifically, the audio renderer is configured to calculate the direct response information from one or more audio objects of each of the plurality of frequency bins and direction information associated with one or more related audio objects of the frequency bins. The direct response information preferably comprises gain values for covariance synthesis or advanced covariance synthesis or for explicitly calculating the contribution of one or more transmission channels.

Preferably, the audio renderer is configured to calculate the covariance composite information using direct response information of one or more related audio objects in the time/frequency band and using information about the plurality of audio channels. Furthermore, covariance synthesis information (which is preferably a mixing matrix) is applied to one or more transmission channels to obtain a plurality of audio channels. In another implementation, the direct response information is a direct response vector for each of the one or more audio objects, and the covariance synthesis information is a covariance synthesis matrix, and the audio renderer is configured to perform a matrix operation for each frequency interval when applying the covariance synthesis information.

Furthermore, the audio renderer 700 is configured to: in calculating the direct response information, direct response vectors for one or more audio objects are derived, and for the one or more audio objects, a covariance matrix is calculated from each direct response vector. Further, in calculating covariance composite information, a target covariance matrix is calculated. However, instead of the target covariance matrix, the correlation information of the target covariance matrix (i.e., the direct response matrix or vector of one or more of the most significant objects, and the diagonal matrix of direct power indicated as E, determined by applying the power ratio) may be used.

Thus, the target covariance information need not necessarily be an explicit target covariance matrix, but is derived from covariance matrices of one audio object or covariance matrices of more audio objects in a time/frequency interval, from power information about respective one or more audio objects in the time/frequency interval, and from power information derived from one or more transmission channels for one or more time/frequency intervals.

The bitstream is representative of being read by a decoder and the encoded transmission channel and the encoded parametric side information contained therein are available for further processing. The parameterized auxiliary information includes:

Direction information (for each frame) as quantized azimuth and elevation values

Object index representing a subset of related objects (for each parameter band)

Quantized power ratio (for each parameter band) of correlation objects with respect to each other

All processing is done in a frame-by-frame fashion, where each frame contains one or more subframes. For example, a frame may be made up of four subframes, in which case one subframe would have a duration of 5 milliseconds. Fig. 4 shows a simplified overview of a decoder.

Fig. 4 shows an audio decoder implementing the first and second aspects. The input interface 600 shown in fig. 6a and 6b comprises a demultiplexer 602, a core decoder 604, a decoder 608 for decoding the object index, a decoder 612 for decoding and dequantizing the power ratio, and a decoder for decoding and dequantizing the direction information indicated at 612. In addition, the input interface includes a filter bank 606 for providing a transmission channel in the form of a time/frequency representation.

The audio renderer 700 includes: a direct response calculator 704; prototype matrix provider 702, controlled by an output configuration received, for example, by a user interface; a covariance synthesis block 706; and a synthesis filter bank 708 to ultimately provide an output audio file comprising the number of audio channels in the channel output format.

Accordingly, items 602, 604, 606, 608, 610, 612 are preferably included in the input interfaces of fig. 6a and 6b, and items 702, 704, 706, 708 of fig. 4 are part of the audio renderer indicated with reference numeral 700 of fig. 6a or 6 b.

The encoded parameterized side information is decoded and the quantized power ratio, quantized azimuth and elevation values (direction information), and object index are obtained again. One power ratio that is not transmitted is obtained by taking advantage of the fact that the sum of all power ratios is 1. Their resolution (l, m) corresponds to the grouping of time/frequency regions employed at the encoder side. During the further processing step using a finer time/frequency resolution (k, n), the parameters of the parameter band are valid for all time/frequency regions contained in the parameter band, corresponding to an extension such that (l, m) → (k, n).

The encoded transmission channel is decoded by a core decoder. Each frame of the audio signal thus decoded is converted into a time/frequency representation using a filter bank (matched to the filter bank used in the encoder), the resolution of which is typically better than (but at least equal to) the resolution used for parameterizing the side information.

Output signal rendering/compositing

The following description applies to one frame of an audio signal; t represents the transpose operator:

using a decoded transmission channel x=x (k, n) = [ X ₁ (k，n)，X ₂ (k，n)] ^T A mixing matrix M for each sub-frame (or frame for reducing computational complexity) is derived to synthesize a time-frequency output signal y=y (k, n) = [ Y) comprising a plurality of output channels (e.g. 5.1, 7.1, 7.1+4, etc.) from an audio signal in the form of a time-frequency representation (in this case comprising two transmission channels) and parametric side information ₁ (k，n)，Y ₂ (k，n)，Y ₃ (k，n)，...] ^T ：

For all (input) objects, a so-called direct response value is determined using the transmitted object direction, which describes the panning gain to be used for the output channel. These direct response values are specific to the target layout, i.e. the number and location of the loudspeakers (provided as part of the output configuration). Examples of translation methods include vector base amplitude translation (VBAP) [ Pulkki1997 ]]And edge fading amplitude translation (EFAP) [ borβ 2014 ]]. Each object has a vector dr of direct response values associated with it _i (containing as many elements as speakers). These vectors are calculated once per frame. Note that: if the object position corresponds to a speaker position, the vector contains a value of 1 for the speaker; all other values are 0. If the object is located between two (or three) speakers, the corresponding number of non-zero vector elements is two (or three).

The actual synthesis step (in this example, covariance synthesis [ Vilkamo2013 ]) comprises the following sub-steps (see visualization of fig. 5):

for each parameter band, describe groupingObject indices to a main object subset of the input objects within the time/frequency region in the parameter band are used to extract the vector subset dr needed for further processing _i . Since only 2 related objects are considered, for example, 2 vectors dr associated with these 2 related objects are needed _i 。

According to the direct response value dr _i Then, for each correlation object, a covariance matrix C of the size of the output channel is calculated _i ：

C _i ＝dr _i *dr _i ^T

For each time/frequency region (within the parameter band), an audio signal power P (k, n) is determined. In the case of two transmission channels, the signal power of the first channel is added to the signal power of the second channel. Each of these power ratios is multiplied by the signal power to produce a direct power value for each correlation/primary object i:

DP _i (k，n)＝PR _i (k，n)*P(k，n)

for each band k, the final target covariance matrix C of the output channel is of size output channel _Y Is obtained by summing all slots n within a (sub) frame and summing all related objects:

Fig. 5 shows a detailed overview of the covariance synthesis step performed in block 706 of fig. 4. Specifically, the fig. 5 embodiment includes a signal power calculation block 721, a direct power calculation block 722, a covariance matrix calculation block 73, a target covariance matrix calculation block 724, an input covariance matrix calculation block 726, a mixing matrix calculation block 725, and a rendering block 727, the rendering block 727 additionally including the filter bank block 708 of fig. 4 with respect to fig. 5, such that the output signal of the block 727 preferably corresponds to a time domain output signal. However, when block 708 is not included in the rendering block of fig. 5, then the result is a spectral domain representation of the corresponding audio channel.

(the following steps are part of the most advanced [ Vilkamo2013] and are added for clarity.)

For each (sub) frame and each frequency band, calculating from the decoded audio signal an input covariance matrix C of the transmission channel of size transmission channel _Y ＝xx ^T . Alternatively, only the entry of the main diagonal may be used, in which case the other non-zero entries are set to zero.

A prototype matrix of transmission channels of size output channels is defined, which describes the mapping of transmission channels to output channels (provided as part of the output configuration), the number of which is given by the target output format (e.g. target speaker layout). The prototype matrix may be static or may vary from frame to frame. Examples: if only a single transmission channel is sent, that transmission channel is mapped to each output channel. If two transmission channels are transmitted, the left (first) channel is mapped to all output channels located at positions within (+ 0 °, +180°), i.e., the "left" channel. The right (second) channel is mapped accordingly to all output channels located at positions within (-0 °, -180 °), i.e. the "right" channel. ( Annotation: the 0 ° describes a position in front of the listener, the positive angle describes a position on the left side of the listener, and the negative angle describes a position on the right side of the listener. If a different convention is adopted, the sign of the angle needs to be adjusted accordingly. )

Use of input covariance matrix C _x Target covariance matrix C _Y And prototype matrix, a mixing matrix is calculated for each (sub) frame and for each frequency band [ Vilkamo2013]Resulting in e.g. 60 mixing matrices per (sub) frame.

The omicronmix matrix performs (e.g., linear) interpolation between (sub) frames, corresponding to temporal smoothing.

Finally, output channels y are synthesized band by band to the corresponding frequency band of the time/frequency representation of decoding transmission channel x by multiplying the final set of mixing matrices M by each of the size output channel x transmission channels:

y＝Mx

note that we do not use the residual signal r as described in Vilkamo 2013.

The output signal y is converted back into a time domain representation y (t) using a filter bank.

Optimized covariance synthesis

Since the input covariance matrix C is calculated for this embodiment _Y And a target covariance matrix C _Y Thus the use of the data from Vilkamo2013 can be implemented]Some optimization of the best mixing matrix calculation results in a significant reduction in the computational complexity of the mixing matrix calculation. Note that in this section, hadamard operatorRepresenting an element-by-element operation on the matrix, i.e. rather than following a rule such as matrix multiplication, the corresponding operation is performed element by element. This operator means: the corresponding operation is not performed on the entire matrix, but is performed on each element separately. Multiplication of the matrices a and B will for example not correspond to matrix multiplication ab=c, but to element-wise operations a_ij x b_ij=c_ij.

SVD (-) represents singular value decomposition. The algorithm presented as Matlab function (list 1) from Vilkamo2013 is as follows (prior art):

input: matrix C of size m x m _x Comprising covariance of input signal

Input: matrix C of size n x n _Y Comprising target covariance of the output signal

Input: matrix Q of size n m, prototype matrix

Input: scalar α, for S _x Regularization factor ([ Vilkamo 2013)]Suggested α=0.2

Input: scalar beta, forRegularization factor ([ Vilkamo 2013)]Suggested β=0.001

Input: the boolean value a, indicating whether energy compensation should be performed instead of calculating the residual covariance C _r

And (3) outputting: matrix M of size n M, optimal mixing matrix

And (3) outputting: matrix C of size n x n _r Including residual covariance

/>

As described in the previous section, only C is optionally used _x And all other entries are set to zero. In this case C _x Is a diagonal matrix and satisfies [ Vilkamo2013]]The effective decomposition of equation (3) is

And SVD from row 3 of the prior art algorithm is no longer needed.

Consider the use of the direct response dr _i And the direct power (or direct energy) in the previous section to generate the formula of the target covariance

C _i ＝dr _i *dr _i ^T

DP _i (k，n)＝PR _i (k，n)*P(k，n)

The last formula can be rearranged and written as

If we now define

Thereby obtaining

It can be easily seen that if we arrange the direct response matrix r= [ dr ] of the k most dominant objects ₁ …dr _k ]And creating a diagonal matrix of direct power as E, where E _i，i ＝E _i ，C _Y Can also be expressed as

C _Y ＝RER ^H

And meets [ Vilkamo2013 ]]C of equation (3) _Y The effective decomposition of (2) is given by:

/>

therefore, SVD from row 1 of the prior art algorithm is no longer needed.

This results in an optimization algorithm for covariance synthesis in this embodiment which also allows for us to always use the energy compensation option and thus does not require the residual target covariance C _x ：

Input: diagonal matrix C of size m x m _x Comprising covariance of input signal with m channels

Input: matrix R of size n x k containing direct responses to k primary objects

Input: diagonal matrix E containing target power of main object

Input: matrix Q of size n m, prototype matrix

Input: scalar alpha, S _x (regularization factor [ Vilkamo 2013)]Suggested α=0.2

Input: the scalar beta is used to calculate the sum of the quantities,(regularization factor [ Vilk ]amo2013]Suggested β=0.001

And (3) outputting: matrix M of size n M, optimal mixing matrix

/>

Careful comparison of the existing algorithm with the proposed algorithm shows that: the former requires three SVDs of matrices of size m×m, n×n, and m×n, respectively, where m is the number of downmix channels and n is the number of output channels to which the object is rendered.

The proposed algorithm only requires one SVD of a matrix of size mxk, where k is the number of primary objects. Furthermore, since k is typically much smaller than n, the matrix is smaller than the corresponding matrix from the prior art algorithm.

For m×n matrix [ Golub2013 ]]The complexity of standard SVD implementations is approximately O (c ₁ m ² n+c ₂ n ³ ) Wherein c ₁ And c ₂ Is a constant depending on the algorithm used. Thus, the computational complexity of the proposed algorithm is significantly reduced compared to prior art algorithms.

Subsequently, preferred embodiments relating to the encoder side of the first aspect are discussed in relation to fig. 7a, 7 b. Furthermore, a preferred implementation of the encoder-side implementation of the second aspect is discussed with respect to fig. 9a to 9 d.

Fig. 7a shows a preferred implementation of the object parameter calculator 100 of fig. 1 a. In block 120, the audio object is converted into a spectral representation. This is achieved by the filter bank 102 of fig. 2 or fig. 3. Then, in block 122, the selection information is calculated, for example, as shown in block 104 of FIG. 2 or FIG. 3. For this purpose, amplitude-related measurements may be used, such as amplitude itself, power, energy or any other amplitude-related measurement obtained by increasing the amplitude to a power different from 1. The result of block 122 is a set of selection information for each object in the corresponding time/frequency interval. Then, in block 124, the object ID for each time/frequency interval is derived. In a first aspect, two or more object IDs per time/frequency interval are derived. According to a second aspect, the number of object IDs per time/frequency interval may even be only a single object ID, such that the most important or strongest or most relevant object of the information provided by block 122 is identified in block 124. Block 124 outputs information about the parameter data and includes single or multiple indices of the most relevant object or objects.

Where there are two or more related objects per time/frequency interval, the function of block 126 is useful for calculating amplitude related measurements that characterize objects in the time/frequency interval. The amplitude related measurement may be the same as the amplitude related measurement that has been calculated for the selection information in block 122 or, preferably, the combined value is calculated using the information that has been calculated by block 102, as indicated by the dashed line between block 122 and block 126, and then the amplitude related measurement or the combined value or values are calculated in block 126 and forwarded to the quantizer and encoder block 212 for having the encoded amplitude related value or encoded combined value in the side information as additional parameterized side information. In the embodiment of fig. 2 or 3, these values are "coding power ratios" included in the bitstream together with "coding object indexes". In the case where each frequency interval has only a single object ID, power ratio calculation and quantization encoding are not necessary, and the index of the most relevant object in the time-frequency interval is sufficient to perform decoder-side rendering.

Fig. 7b shows a preferred implementation of the calculation of the selection information 102 of fig. 7 b. As shown in block 123, signal power is calculated as selection information for each object and for each time/frequency interval. Then, in block 125, which illustrates a preferred implementation of block 124 of fig. 7a, the object IDs of the single or preferably two or more objects with the highest power are extracted and output. Further, in the case of two or more related objects, the power ratio is calculated as shown in block 127, which is a preferred implementation of block 126, wherein the power ratio is calculated for the extracted object ID, which is related to the power of all extracted objects having the corresponding object ID found by block 125. This procedure is advantageous because only a combined value of 1 less than the number of objects of the time/frequency interval has to be transmitted, because in this embodiment there is a rule known to the decoder that specifies that the power ratio of all objects has to be summed to 1. Preferably, the functions of blocks 120, 122, 124, 126 of fig. 7a and/or 123, 125, 127 of fig. 7b are implemented by the object parameter calculator 100 of fig. 1a, and the functions of block 212 of fig. 7a are implemented by the output interface 200 of fig. 1 a.

The apparatus for encoding according to the second aspect shown in fig. 1b is subsequently described in more detail in relation to several embodiments. In step 110a, the direction information is either extracted from the input signal, for example as shown in fig. 12a, or by reading or parsing the metadata information included in the metadata portion or metadata file. In step 110b, the direction information of each frame and audio object is quantized and the quantization index of each frame per object is forwarded to an encoder or output interface, such as output interface 200 of fig. 1 b. In step 110c, the vector index is dequantized to have dequantized values that may also be output directly by block 110b in some implementations. Then, based on the dequantized direction index, block 422 calculates weights for each transmission channel and each object based on some virtual microphone setting. The virtual microphone arrangement may comprise two virtual microphone signals arranged at the same location and having different orientations, or may be an arrangement as follows: where there are two different positions relative to a reference position or orientation (e.g., virtual listener position or orientation). A setup with two virtual microphone signals will result in a weight of two transmission channels per object.

In the case of generating three transmission channels, the virtual microphone arrangement may be considered to comprise three virtual microphone signals at the same location and having microphones of different orientations, or arranged at three different locations relative to a reference location or orientation, which may be a virtual listener position or orientation.

Alternatively, four transmission channels may be generated based on a virtual microphone arrangement that generates four virtual microphone signals from microphones arranged at the same location and having different orientations, or four virtual microphone signals from microphones arranged at four different locations relative to a reference location or orientation, which may be a virtual listener position or virtual listener orientation.

Furthermore, in order to calculate the weight w of each object and each transmission channel (two channels are taken as an example) _L And w _R The virtual microphone signal is a signal derived from a virtual first order microphone or a virtual heart microphone or a virtual 8-shaped microphone or a dipole microphone or a bi-directional microphone, or a signal derived from a virtual directional microphone or from a virtual sub-heart microphone or from a virtual unidirectional microphone or from a virtual super heart microphone or from a virtual omni-directional microphone.

In this context, it should be noted that no placement of the actual microphones is required in order to calculate the weights. Instead, the rules for calculating weights vary depending on the virtual microphone settings (i.e., placement of the virtual microphones and characteristics of the virtual microphones).

In block 404 of fig. 9a, weights are applied to the objects such that for each object, the contribution of the object to a particular transmission channel is obtained if the weights are different from 0. Thus, block 404 receives as input the object signal. Then, in block 406, the contributions of each transmission channel are summed such that, for example, the contributions of the objects to the first transmission channel are added together and the contributions of the objects to the second transmission channel are added together, and so on. As shown in block 406, the output of block 406 is then a transmission channel in, for example, the time domain.

Preferably, the object signal input into block 404 is a time domain object signal with full band information, and the applying in block 404 and summing in block 406 are performed in the time domain. However, in other embodiments, these steps may also be performed in the spectral domain.

Fig. 9b shows another embodiment for implementing static downmixing. To this end, direction information of the first frame is extracted in block 130 and weights are calculated depending on the first frame, as shown in block 403 a. Then, for other frames indicated in block 408, the weights remain intact to achieve static downmixing.

Fig. 9c shows another implementation of computational dynamic downmixing. To this end, block 132 extracts the direction information for each frame and updates the weights for each frame, as shown in block 403 b. Then, in block 405, the updated weights are applied to the frames to achieve dynamic downmixing from frame to frame. Other implementations between those extremes of fig. 9b and 9c are also useful, where, for example, weights are updated only for every second, third or every nth frame, and/or weight smoothing over time is performed so that antenna characteristics do not change much from time to time for the purpose of downmixing according to direction information. Fig. 9d shows another implementation of the down-mixer 400 controlled by the object direction information provider 110 of fig. 1 b. In block 410, the down-mixer is configured to analyze the direction information of all objects in the frame, and in block 112, weights w for computing stereo examples are calculated _L And w _R The microphones are placed in agreement with the analysis results, wherein the placement of the microphones refers to the microphone position and/or the microphone direction. In block 414, the microphone is left to other frames, similar to the static downmixing discussed with respect to block 408 of fig. 9b, or updated according to what was discussed with respect to block 405 of fig. 9c, in order to obtain the functionality of block 414 of fig. 9 d. Regarding the function of block 412, the microphones may be placed such that a good separation is obtained such that a first virtual microphone "sees" a first set of objects and a second virtual microphone "sees" a second set of objects, which are different from the first set of objects, and preferably, in that any object in one set is as far as possible not included in the other set. Alternatively, the analysis of block 410 may be enhanced by other parameters, and placement may also be controlled by other parameters.

Subsequently, preferred implementations of the decoder according to the first or second aspect and discussed in relation to e.g. fig. 6a and 6b are given by fig. 10a, 10b, 10c, 10d and 11 below.

In block 613, the input interface 600 is configured to obtain individual object direction information associated with the object ID. This process corresponds to the function of block 612 of fig. 4 or fig. 5 and results in a "codebook of frames" as shown and discussed with respect to fig. 8b and in particular 8 c.

Further, in block 609, one or more object IDs for each time/frequency interval are obtained, regardless of whether the data is available for a low resolution parameter band or a high resolution frequency region. The result of block 609, which corresponds to the process of block 608 in fig. 4, is a particular ID of one or more related objects in the time/frequency interval. Then, in block 611, specific object direction information for a specific ID or IDs for each time/frequency interval is obtained from the "codebook of frames" (i.e., from the exemplary table shown in fig. 8 c). Then, in block 704, gain values for one or more related objects of the respective output channels controlled by the output format are calculated for each time/frequency interval. Then, in block 730 or 706, 708, an output channel is calculated. The function of the calculation of the output channel may be performed in an explicit calculation of the contribution to one or more transmission channels as shown in fig. 10b, or may be performed by an indirect calculation and use of the contribution to the transmission channels as shown in fig. 10d or fig. 11. Fig. 10b shows the function of acquiring a power value or power ratio in block 610 corresponding to the function of fig. 4. These power values are then applied to the respective transmission channels of each associated object, as indicated by blocks 733 and 735. Further, these power values are applied to the respective transmission channels in addition to the gain values determined by block 704, such that blocks 733, 735 produce object-specific contributions of the transmission channels (e.g., transmission channels ch1, ch2 … …). These explicitly calculated channel transmission contributions are then added together for each output channel for each time/frequency interval in block 737.

Then, depending on the implementation, a spread signal calculator 741 may be provided, the spread signal calculator 741 generating a spread signal for each output channel ch1, ch2 … … in a corresponding time/frequency interval, and the combination of the spread signal and the contribution results of the block 737 are combined so that a complete channel contribution in each time/frequency interval is obtained. When covariance synthesis additionally depends on a diffuse signal, the signal corresponds to the input of the filter bank 708 of fig. 4. However, when the covariance synthesis 706 does not depend on the diffuse signal but only on the processing without any decorrelators, then the energy of the output signal for at least each time/frequency interval corresponds to the energy of the channel contribution at the output of block 739 of fig. 10 b. Furthermore, without the use of the diffuse signal calculator 741, the result of block 739 corresponds to the result of block 706, i.e. a complete channel contribution with each time/frequency interval that can be converted separately for each output channel ch1, ch2, in order to finally obtain an output audio file with time-domain output channels, which can be stored or forwarded to a loudspeaker or any type of rendering device.

Fig. 10c shows a preferred implementation of the functionality of block 610 of fig. 10b or fig. 4. In step 610a, a combined (power) value or values are obtained for a certain time/frequency interval. In block 610b, corresponding other values of other related objects in the time/frequency interval are calculated based on the calculation rule that all combined values have to sum to 1.

The result will then preferably be a low resolution representation with two power ratios for each packet slot index and each parameter band index. These power ratios represent low time/frequency resolution. In block 610c, the time/frequency resolution may be extended to a high time/frequency resolution such that it has power values of the time/frequency regions of the high resolution slot index n and the high resolution band index k. The expansion may include directly using one and the same low resolution index for a corresponding time slot within the packet time slot and a corresponding frequency band within the parameter band.

Fig. 10d shows a preferred implementation of the function for calculating covariance composite information in block 706 of fig. 4, represented by a mixing matrix 725 for mixing two or more input transmission channels into two or more output signals. Thus, when there are, for example, two transmission channels and six output channels, the size of the mixing matrix for each individual time/frequency interval will be six rows and two columns. In a block 723 corresponding to the function of the block 723 in fig. 5, a gain value or a direct response value of each object in each time/frequency interval is received and a covariance matrix is calculated. In block 722, a power value or power ratio is received and a direct power value for each object in the time/frequency interval is calculated, and block 722 in fig. 10d corresponds to block 722 of fig. 5.

The results of both blocks 721 and 722 are input into the target covariance matrix calculator 724. Additionally or alternatively, the target covariance matrix C _y Explicit calculation of (c) is not necessary. Instead, the correlation information included in the target covariance matrix (i.e., the direct response value information indicated in matrix R and the direct power values indicated in matrix E for two or more correlation objects) is input into the block 725a for calculating the mixing matrix for each time/frequency interval. Furthermore, the mixing matrix 725a receives the input covariance matrix C derived from two or more transmission channels shown in block 726 corresponding to block 726 of fig. 5 _x And information of prototype matrix Q. The mixing matrix for each time/frequency interval and frame may be time smoothed, as shown in block 725b, and in block 727 corresponding to at least a portion of the rendering block of fig. 5, the mixing matrix is applied to the transmission channels in the corresponding time/frequency interval in a non-smoothed or smoothed form to obtain complete channel contributions in the time/frequency interval, substantially similar to the corresponding complete contributions discussed earlier with respect to fig. 10b at the output of block 739. Thus, fig. 10b shows an implementation of explicit calculation of the transmission channel contribution, while fig. 10d shows the transmission channel contribution via the target covariance matrix C _y Or implicitly calculate the transmission channel contribution of each time/frequency interval and each related object in each time frequency interval via the related information R and E directly introduced into the mixing matrix calculation block 725a of blocks 723 and 722.

Subsequently, a preferred optimization algorithm for covariance synthesis is shown with respect to fig. 11. It is to be summarized that all steps shown in fig. 11 are calculated within the covariance synthesis 706 of fig. 4 or within the mixing matrix calculation block 725 of fig. 5 or 725a in fig. 10 d. In step 751, a first decomposition result K is calculated _y . Due to the fact that: as shown in fig. 10d, gain value information included in the matrix R and the information from the matrix R are directly usedThe information of two or more related objects, in particular, the direct power information included in the matrix ER, does not need to explicitly calculate the covariance matrix, and thus the decomposition result can be easily calculated. Thus, the first decomposition result in block 751 can be directly calculated and does not require much effort, as the specific singular value decomposition is no longer required.

In step 752, the second decomposition result is calculated as K _x . Since the input covariance matrix is considered as a diagonal matrix ignoring off-diagonal elements, the decomposition result can also be calculated without explicit singular value decomposition.

Then, in step 753, a first regularization result based on the first regularization parameter α is calculated, and in step 754, a second regularization result is calculated based on the second regularization parameter β. Due to K _x In a preferred implementation, a diagonal matrix, thus simplifying the calculation of the first regularized result 753 relative to the prior art, because of S _x Is only a parameter variation, not a decomposition as in the prior art.

Furthermore, for the calculation of the second regularized result in block 754, the first step is additionally just parameter renaming instead of the prior art and matrix U _x ^HS Is multiplied by (c).

Further, in step 755, a normalized matrix G is calculated ^y And based on step 755, K-based in step 756 _x And prototype matrix Q and K obtained by block 751 _y Is used to calculate the unitary matrix P. Since no matrix Λ is needed here, the calculation of the unitary matrix P is simplified with respect to the available prior art.

Then, in step 757, a mixing matrix without energy compensation, i.e., mo, is calculated _pt And for this purpose, unitary matrix P, the result of block 754 and the result of block 751 are used. Then, in block 758, energy compensation is performed using the compensation matrix G. The energy compensation is performed such that no residual signal derived from the decorrelator is needed. However, instead of performing energy compensation, in this implementation energy of sufficient magnitude will be added to fill the matrix Mo by mixing _pt Leaving no energy gapResidual signal of energy information. However, for the purposes of the present invention, the decorrelated signal is not relied upon to avoid any spurious tones introduced by the decorrelator. However, energy compensation as shown in step 758 is preferred.

Thus, the optimization algorithm for covariance synthesis provides advantages for the calculation of unitary matrix P in steps 751, 752, 753, 754 and in step 756. It is emphasized that the optimization algorithm even provides advantages over the prior art in that only one of steps 755, 752, 753, 754, 756, or only a subset of these steps, is implemented as shown, but the corresponding other steps are implemented as in the prior art. The reason is that these improvements do not depend on each other but can be applied independently of each other. However, the more improvements are achieved in terms of complexity of implementation, the better the process will be. Thus, a complete implementation of the embodiment of fig. 11 is preferred as it provides the greatest complexity reduction, but even if only one of the steps 751, 752, 753, 754, 756 is implemented according to an optimization algorithm and the other steps are implemented as in the prior art, the complexity reduction is obtained without any quality degradation.

Embodiments of the present invention may also be considered as the following process: comfort noise is generated for the stereo signal by mixing three Gaussian noise sources (one Gaussian noise source per channel) and a third common noise source for creating correlated background noise, or, additionally or separately, the mixing of the noise sources is controlled with the coherence value transmitted with the SID frame.

It is to be noted here that all alternatives or aspects discussed before and after and all aspects defined by the claims in the appended claims may be used alone, i.e. without any other alternatives or objectives different from the envisaged alternatives, objectives or independent claims. However, in other embodiments, two or more alternatives or aspects or independent claims may be combined with each other, and in other embodiments, all aspects or alternatives and all independent claims may be combined with each other.

The encoded signals of the present invention may be stored on a digital storage medium or a non-transitory storage medium, or may be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the internet.

Although some aspects have been described in the context of apparatus, it will be clear that these aspects also represent descriptions of corresponding methods in which a block or apparatus corresponds to a method step or a feature of a method step. Similarly, aspects described in the context of method steps also indicate description of features of the respective block or item or the respective apparatus.

Embodiments of the invention may be implemented in hardware or in software, depending on certain implementation requirements. Implementations may be performed using a digital storage medium (e.g., floppy disk, DVD, CD, ROM, PROM, EPROM, EEPROM, or flash memory) having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the corresponding method is performed.

Some embodiments according to the invention comprise a data carrier with electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.

In general, embodiments of the invention may be implemented as a computer program product having a program code operable to perform one of the methods when the computer program product is run on a computer. The program code may for example be stored on a machine readable carrier.

Other embodiments include a computer program stored on a machine readable carrier or non-transitory storage medium for performing one of the methods described herein.

In other words, an embodiment of the method of the invention is thus a computer program with a program code for performing one of the methods described herein when the computer program runs on a computer.

Thus, other embodiments of the method of the present invention are a data carrier or digital storage medium or computer readable medium having a computer program recorded thereon for performing one of the methods described herein.

Thus, other embodiments of the methods of the present invention are data streams or signal sequences representing a computer program for performing one of the methods described herein. The data stream or signal sequence may, for example, be configured to be transmitted via a data communication connection (e.g., via the internet).

Another embodiment includes a processing device, such as a computer or programmable logic device, configured or adapted to perform one of the methods described herein.

Another embodiment includes a computer having a computer program installed thereon for performing one of the methods described herein.

In some embodiments, a programmable logic device (e.g., a field programmable gate array) may be used to perform some or all of the functions of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor to perform one of the methods described herein. In general, the method is preferably performed by any hardware device.

The above-described embodiments are merely illustrative of the principles of the present invention. It should be understood that modifications and variations of the arrangements and details described herein will be apparent to those skilled in the art. It is therefore intended that the scope of the appended patent claims be limited only and not by the specific details given by way of description and explanation of the embodiments herein.

Aspects (used independently of each other, or with all or only a subset of other aspects)

An apparatus, method or computer program comprises one or more of the following features:

inventive examples on novel aspects:

multi-wave idea combined with object coding (more than one directional hint is used per T/F region)

Object coding method that is as close as possible to the DirAC paradigm to allow any kind of input type in IVAS (object content not currently covered)

Examples of inventions regarding parameterization (encoder):

for each T/F region: selection information of n most relevant objects in the T/F zone plus power ratio between the n most relevant object contributions

For each frame, for each object: one direction of

Inventive examples on rendering (decoder):

obtaining a direct response value for each related object from the transmitted object index and direction information and the target output layout

Obtaining covariance matrix from direct response

Calculating the direct power from the downmix signal power and the transmit power ratio of each correlation object

Obtaining a final target covariance matrix from the direct power and covariance matrices

Diagonal elements using only the input covariance matrix

Optimized covariance synthesis

Some notes about differences from SAOC:

consider n primary objects instead of all objects

The power ratio is thus related to OLD, but calculated differently

SAOC is introduced at the decoder (rendering matrix) only, without using direction- > direction information at the encoder

The [ fwdarw ] -3D decoder receives object metadata for rendering a matrix

SAOC employs a downmix matrix and transmits a downmix gain

The embodiment of the invention does not consider the diffusivity

Subsequently, other examples of the present invention are summarized.

1. An apparatus for encoding a plurality of audio objects, comprising:

an object parameter calculator (100) configured to: calculating parameter data of at least two related audio objects for one or more of a plurality of frequency bins related to a time frame, wherein the number of the at least two related audio objects is lower than the total number of the plurality of audio objects, and

an output interface (200) configured to output an encoded audio signal comprising information about parametric data of the at least two related audio objects of the one or more frequency bins.

2. The apparatus of example 1, wherein the object parameter calculator (100) is configured to:

converting (120) each of the plurality of audio objects into a spectral representation having a plurality of frequency bins,

calculating (122) selection information for each audio object of the one or more frequency bins, and

deriving (124) an object identification as parameter data indicative of the at least two related audio objects based on the selection information, and

wherein the output interface (200) is configured to introduce information about the object identification into the encoded audio signal.

3. The apparatus of example 1 or 2, wherein the object parameter calculator (100) is configured to: quantizing and encoding (212) one or more amplitude-related measures of related audio objects in the one or more frequency bins or one or more combined values derived from the amplitude-related measures as the parameter data, and

wherein the output interface (200) is configured to introduce quantized one or more amplitude-related measurement values or quantized one or more combined values into the encoded audio signal.

4. According to the apparatus of example 2 or 3,

wherein the selection information is an amplitude related measurement value of the audio object such as an amplitude value, a power value or a loudness value, or an amplitude raised to a power different from 1, and

wherein the object parameter calculator (100) is configured to calculate (127) a combined value, such as a ratio of an amplitude-related measure of the related audio object to a sum of two or more amplitude-related measures of the related audio object, and

wherein the output interface (200) is configured to: introducing information about the combined value into the encoded audio signal, wherein a number of information items about the combined value in the encoded audio signal is at least equal to 1 and less than a number of associated audio objects of the one or more frequency bins.

5. According to the apparatus of one of examples 2 to 4,

wherein the object parameter calculator (100) is configured to select the object identification based on an order of selection information of the plurality of audio objects in the one or more frequency bins.

6. The apparatus of one of examples 2 to 5, wherein the object parameter calculator (100) is configured to:

calculating (122) signal power as the selection information,

object identifications of two or more audio objects of the maximum signal power value in the corresponding one or more frequency bins are derived (124) for each frequency bin,

calculating (126) as the parameter data a power ratio between a sum of signal powers of two or more audio objects having the maximum signal power value and a signal power of each of the audio objects having the derived object identification, and

quantizing and encoding (212) the power ratio, and

wherein the output interface (200) is configured to introduce a quantized and encoded power ratio into the encoded audio signal.

7. The apparatus of one of examples 1 to 6, wherein the output interface (200) is configured to introduce the following into the encoded audio signal:

One or more of the encoded transmission channels,

two or more coding object identifications of related audio objects as each of one or more of the plurality of frequency bins in the time frame, and one or more coding combination values or coding amplitude related measurement values, as the parameter data, and

quantized and encoded direction data for each audio object in the time frame, the direction data being constant for all of the one or more frequency bins.

8. The apparatus of one of examples 1 to 7, wherein the object parameter calculator (100) is configured to: calculating parameter data of at least a most significant object and a second most significant object in the one or more frequency bins, or

Wherein the number of audio objects in the plurality of audio objects is three or more, the plurality of audio objects including a first audio object, a second audio object, and a third audio object, and

wherein the object parameter calculator (100) is configured to: for a first frequency interval of the one or more frequency intervals, computing only a first set of audio objects, such as the first audio object and the second audio object, as the related audio objects; and for a second frequency interval of the one or more frequency intervals, computing only a second set of audio objects, such as the second audio object and the third audio object or the first audio object and the third audio object, as the related audio objects, wherein the first set of audio objects differs from the second set of audio objects in at least one group member.

9. The apparatus according to one of examples 1 to 8, wherein the object parameter calculator (100) is configured to:

calculating raw parametric data having a first time or frequency resolution and combining the raw parametric data into combined parametric data having a second time or frequency resolution lower than the first time or frequency resolution, and calculating parametric data of the at least two related audio objects with respect to the combined parametric data having the second time or frequency resolution, or

A parameter band having a second time or frequency resolution different from the first time or frequency resolution used in the time or frequency decomposition of the plurality of audio objects is determined, and parameter data of the at least two related audio objects is calculated for the parameter band having the second time or frequency resolution.

10. The apparatus of one of the preceding examples, wherein the plurality of audio objects comprises related metadata indicating directional information (810) about the plurality of audio objects, and

wherein the apparatus further comprises:

-a downmixer (400) for downmixing the plurality of audio objects to obtain one or more transmission channels, wherein the downmixer (400) is configured to: downmixing the plurality of audio objects in response to directional information about the plurality of audio objects; and

A transmission channel encoder (300) for encoding one or more transmission channels to obtain one or more encoded transmission channels; and

wherein the output interface (200) is configured to: the one or more transmission channels are introduced into the encoded audio signal.

11. The apparatus of example 10, wherein the downmixer (400) is configured to:

generating two transmission channels as two virtual microphone signals arranged at the same location and having different orientations, or arranged at two different locations relative to a reference location or orientation, such as a virtual listener location or orientation, or

Generating three transmission channels as three virtual microphone signals arranged at the same location and having different orientations, or arranged at three different locations relative to a reference location or orientation, such as a virtual listener location or orientation, or

Generating four transmission channels as four virtual microphone signals arranged at the same location and having different orientations, or arranged at four different locations relative to a reference location or orientation such as a virtual listener location or orientation, or

Wherein the virtual microphone signal is a virtual first order microphone signal, or a virtual heart-shaped microphone signal, or a virtual 8-shaped or dipole or bi-directional microphone signal, or a virtual sub-heart-shaped microphone signal, or a virtual unidirectional microphone signal, or a virtual super-heart-shaped microphone signal, or a virtual omni-directional microphone signal.

12. The apparatus of example 10 or 11, wherein the downmixer (400) is configured to:

for each of the plurality of audio objects, deriving (402) weighting information for each transmission channel using direction information of the corresponding audio object;

weighting (404) the corresponding audio object for a particular transmission channel using weighting information for the audio object to obtain an object contribution for the particular transmission channel, and

-combining (406) object contributions of the plurality of audio objects to the specific transmission channel to obtain the specific transmission channel.

13. According to one of examples 10 to 12,

wherein the downmixer (400) is configured to: calculating the one or more transmission channels as one or more virtual microphone signals, the one or more virtual microphone signals being arranged at the same location and having different orientations, or being arranged at different locations relative to a reference location or orientation, such as a virtual listener location or orientation, the direction information being related to the reference location or orientation,

Wherein the different positions or orientations are on or to the left of a centerline and on or to the right of a centerline, or wherein the different positions or orientations are evenly or unevenly distributed to horizontal positions or orientations, e.g., +90 degrees or-90 degrees with respect to the centerline, or-120 degrees, 0 degrees and +120 degrees with respect to the centerline, or wherein the different positions or orientations comprise at least one position or orientation pointing upwards or downwards with respect to a horizontal plane in which a virtual listener is located, wherein direction information about the plurality of audio objects is related to the virtual listener position or reference position or orientation.

14. The apparatus of one of examples 10 to 13, further comprising:

a parameter processor (110) for quantizing metadata indicating direction information about the plurality of audio objects to obtain quantized direction terms for the plurality of audio objects,

wherein the downmixer (400) is configured to: operating in response to a quantized direction item as the direction information, and

wherein the output interface (200) is configured to: information about the quantization direction term is introduced into the encoded audio signal.

15. According to one of examples 10 to 14,

wherein the downmixer (400) is configured to: an analysis is performed (410) on the directional information about the plurality of audio objects and one or more virtual microphones are placed (412) according to a result of the analysis to generate a transmission channel.

16. According to one of examples 10 to 15,

wherein the downmixer (400) is configured to: downmixing (408) using a downmixing rule that is static over a plurality of time frames, or

Wherein the direction information is variable over a plurality of time frames, and wherein the downmixer (400) is configured to: the downmixing is performed using a downmix rule that is variable over the plurality of time frames (405).

17. The apparatus of one of examples 10 to 16, the downmixer (400) configured to: the samples of the plurality of audio objects are downmixed in the time domain using a sample-by-sample weighted sum combination.

18. A decoder for decoding an encoded audio signal, the encoded audio signal comprising: one or more transmission channels and direction information of a plurality of audio objects, and parameter data of at least two related audio objects for one or more frequency bins of a time frame, wherein the number of the at least two related audio objects is lower than the total number of the plurality of audio objects, the decoder comprising:

-an input interface (600) for providing said one or more transmission channels in the form of a spectral representation, said spectral representation having a plurality of frequency bins in said time frame; and

an audio renderer (700) for rendering the one or more transmission channels into a plurality of audio channels using the direction information such that contributions from the one or more transmission channels are taken into account according to a first direction information associated with a first related audio object of the at least two related audio objects and according to a second direction information associated with a second related audio object of the at least two related audio objects, or

Wherein the audio renderer (700) is configured to: for each of the one or more frequency bins, calculating a contribution from the one or more transmission channels from first direction information associated with a first of the at least two related audio objects and from second direction information associated with a second of the at least two related audio objects.

19. According to the decoder of example 18,

wherein the audio renderer (700) is configured to: for the one or more frequency bins, direction information of audio objects different from the at least two related audio objects is ignored.

20. The decoder of examples 18 or 19,

wherein the encoded audio signal comprises an amplitude related measurement (812) of each related audio object in the parameter data or a combined value (812) related to at least two related audio objects, and

wherein the audio renderer (700) is configured to: a quantitative contribution of the one or more transmission channels is determined (704) from the amplitude-dependent measurement values or the combined values.

21. The decoder of example 20, wherein the encoded signal includes the combined value in the parameter data, and

wherein the audio renderer (700) is configured to: determining (704, 733) a contribution of the one or more transmission channels using the combined value of one of the related audio objects and the directional information of the one related audio object, and

wherein the audio renderer (700) is configured to: the contribution of the one or more transmission channels is determined (704, 735) using values derived from combined values of another of the related audio objects in the one or more frequency bins and the direction information of the other related audio object.

22. The decoder according to one of examples 18 to 21, wherein the audio renderer (700) is configured to:

Direct response information is calculated (704) from the associated audio object for each of the plurality of frequency bins and the direction information associated with the associated audio object in the frequency bin.

23. According to the decoder of example 22,

wherein the audio renderer (700) is configured to: determining (741) a diffuse signal for each of the plurality of frequency bins using diffuseness information, such as diffuseness parameters, or a decorrelation rule, included in the metadata, and combining a direct response determined by the direct response information and the diffuse signal to obtain a spectral domain rendering signal of a channel of the plurality of channels, or

-calculating (706) synthesis information using the direct response information (704) and information (702) about the plurality of audio channels, and-applying (727) covariance synthesis information to the one or more transmission channels to obtain the plurality of audio channels, or

Wherein the direct response information (704) is a direct response vector for each related audio object, and wherein the covariance synthesis information is a covariance synthesis matrix, and wherein the audio renderer (700) is configured to: a matrix operation is performed for each frequency bin when applying (727) the covariance synthesis information.

24. The decoder of example 22 or 23, wherein the audio renderer (700) is configured to:

in calculating the direct response information (704), deriving a direct response vector for each associated audio object; and for each associated audio object, calculating a covariance matrix from each direct response vector,

in calculating the covariance synthesis information, target covariance information is derived (724) from: a covariance matrix from each of the correlated audio objects, power information about the respective correlated audio object, and power information derived from the one or more transmission channels.

25. The decoder of example 24, wherein the audio renderer (700) is configured to:

in calculating the direct response information (704), deriving a direct response vector for each associated audio object; and for each associated audio object, calculating (723) a covariance matrix from each direct response vector,

deriving (726) input covariance information from the transmission channel, and

deriving (725 a,725 b) mixing information from the target covariance information, the input covariance information, and information about the plurality of channels, and

-applying (727) the mixing information to a transmission channel of each frequency interval in the time frame.

26. The decoder of example 25, wherein the result of applying the mixing information for each frequency interval in the time frame is converted (708) into a time domain to obtain a plurality of audio channels in the time domain.

27. The decoder according to one of examples 22 to 26, wherein the audio renderer (700) is configured to:

in decomposing (752) an input covariance matrix derived from the transmission channels, only the principal diagonal elements of the input covariance matrix are used, or

Performing a decomposition (751) of a target covariance matrix using a direct response matrix and a power matrix of the object or transmission channel, or

Performing (752) a decomposition of the input covariance matrix by taking the root of each main diagonal element of the input covariance matrix, or

Calculating (753) a regularized inverse of the decomposed input covariance matrix, or

Singular value decomposition is performed (756) in computing the optimal matrix to be used for energy compensation without extending the identity matrix.

28. A method of encoding a plurality of audio objects and associated metadata indicative of directional information about the plurality of audio objects, comprising:

Downmixing the plurality of audio objects to obtain one or more transmission channels;

encoding the one or more transmission channels to obtain one or more encoded transmission channels; and

outputting an encoded audio signal comprising the one or more encoded transmission channels,

wherein the downmixing includes downmixing the plurality of audio objects in response to directional information about the plurality of audio objects.

29. A method of decoding an encoded audio signal, the encoded audio signal comprising: one or more transmission channels and direction information of a plurality of audio objects, and parameter data of at least two related audio objects for one or more frequency bins of a time frame, wherein the number of the at least two related audio objects is lower than the total number of the plurality of audio objects, the method of decoding comprising:

providing the one or more transmission channels in the form of a spectral representation having a plurality of frequency bins in the time frame; and

rendering the one or more transmission channels audio into a plurality of audio channels using the direction information,

wherein the audio rendering comprises: for each of the one or more frequency bins, calculating contributions from the one or more transmission channels from first direction information associated with a first of the at least two related audio objects and from second direction information associated with a second of the at least two related audio objects, or such that contributions from the one or more transmission channels are taken into account from first direction information associated with the first of the at least two related audio objects and from second direction information associated with the second of the at least two related audio objects.

30. A computer program for performing the method according to example 28 or the method according to example 29 when run on a computer or processor.

31. An encoded audio signal comprising information about parametric data of at least two related audio objects of one or more frequency bins.

32. The encoded audio signal of example 31, further comprising:

one or more of the encoded transmission channels,

two or more coding object identifications of related audio objects for each of one or more of the plurality of frequency bins in the time frame, and one or more coding combination values or coding amplitude related measurement values as information about the parameter data, and

2Reference to the literature

[Pulkki2009]V.Pulkki,M-V.Laitinen,J.Vilkamo,J.Ahonen,T.Lokki,and T.“Directional audio coding perception-based reproduction of spatial sound”,International Workshop on the Principles and Appl ication on Spatial Hearing,Nov.2009,Zao；Miyagi,Japan.

[SAOC_STD]ISO/IEC,“MPEG audio technologies Part 2:Spatial Audio Object Coding(SAOC).”ISO/IEC JTC1/SC29/WG11(MPEG)International Standard 23003-2.

[SAOC_AES]J.Herre,H.Purnhagen,J.Koppens,O.Hellmuth,J.J.Hilpert,L.Villemoes,L.Terentiv,C.Falch,A./>M.L.Valero,B.Resch,H.Mundt H,and H.Oh,“MPEG spatial audio object coding—the ISO/MPEG standard for efficient coding of interactive audio scenes,”J.AES,vol.60,no.9,pp.655–673,Sep.2012.

[MPEGH_AES]J.Herre,J.Hi lpert,A.Kuntz,and J.Plogsties,“MPEG-H audio—the new standard for universal spatial/3D audio coding,”in Proc.137 ^th AES Conv.,Los Angeles,CA,USA,2014.

[MPEGH_IEEE]J.Herre,J.Hilpert,A.Kuntz,and J.Plogsties,“MPEG-H 3D Audio—The New Standard for Coding of Immersive Spatial Audio“,IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING,VOL.9,NO.5,AUGUST 2015

[MPEGH_STD]Text of ISO/MPEG23008–3/DIS 3D Audio,Sapporo,ISO/IEC JTC1/SC29/WG11 N14747,Jul.2014.

[SAOC_3D_PAT]APPARATUS AND METHOD FOR ENHANCED SPATAL AUDIO OBJECT CODING,WO 2015/011024 A1

[Pulkki1997]V.Pulkki,“Virtual sound source positioning using vector base ampl itude panning,”J.Audio Eng.Soc.,vol.45,no.6,pp.456–466,Jun.1997.

[DELAUNAY]C.B.Barber,D.P.Dobkin,and H.Huhdanpaa,“The quickhull algorithm for convex hulls,”in Proc.ACM Trans.Math.Software(TOMS),New York,NY,USA,Dec.1996,vol.22,pp.469–483.

[Hirvonen2009]T.Hirvonen,J.Ahonen,and V.Pulkki,“Perceptual compression methods for metadata in Directional Audio Coding applied to audiovisual teleconference”,AES 126 ^th Convention 2009,May 7–10,Munich,Germany.

[Borβ2014]C.Borβ,“A Polygon-Based Panning Method for 3D Loudspeaker Setups”,AES 137 ^th Convention 2014,October 9 -12,Los Angeles,USA.

[WO2019068638]Apparatus,method and computer program for encoding,decoding,scene processing and other procedures related to DirAC based spatial audio coding,2018

[WO2020249815]PARAMETER ENCODING AND DECODING FOR MULTICHANNEL AUDIO USING DirAC,2019

[BCC2001]C.Faller,F.Baumgarte:“Efficient representation of spatial audio using perceptual parametrization”,Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics(Cat.No.01TH8575).

[JOC_AES]Heiko Purnhagen；Toni Hirvonen；Lars Villemoes；Jonas Samuelsson；Janusz Klejsa:“Immersive Audio Delivery Using Joint Object Coding”,140 ^th AES Convention,Paper Number:9587,Paris,May 2016.

[AC4_AES]K.J./>M.Wolters,J.Riedmiller,A.Biswas,P.Ekstrand,A./>P.Hedelin,T.Hirvonen,H./>J.Klejsa,J.Koppens,K.Krauss,H-M.Lehtonen,K.Linzmeier,H.Muesch,H.Mundt,S.Norcross,J.Popp,H.Purnhagen,J.Samuelsson,M.Schug,L./>R.Thesing,L.Villemoes,and M.Vinton:“AC-4–The Next Generation Audio Codec”,140 ^th AES Convention,Paper Number:9491,Paris,May2016.

[Vilkamo2013]J.Vilkamo,T.A.Kuntz,“Optimized covariance domain framework for time-frequency processing of spatial audio”,Journal of the Audio Engineering Society,2013.

[Golub2013]Gene H.Golub and Charles F.Van Loan,“Matrix Computations”,Johns Hopkins University Press,4th edition,2013。

Claims

1. An apparatus for encoding a plurality of audio objects and associated metadata indicative of directional information about the plurality of audio objects, comprising:

a downmixer (400) for downmixing the plurality of audio objects to obtain one or more transmission channels;

an output interface (200) for outputting an encoded audio signal comprising said one or more encoded transmission channels,

wherein the downmixer (400) is configured to downmix the plurality of audio objects in response to direction information about the plurality of audio objects.

2. The apparatus of claim 1, wherein the downmixer (400) is configured to:

3. The apparatus of claim 1 or 2, wherein the downmixer (400) is configured to:

4. The device according to any of the preceding claims,

Wherein the different positions or orientations are on or to the left of a centerline and on or to the right of the centerline, or wherein the different positions or orientations are evenly or unevenly distributed to horizontal positions or orientations, e.g., +90 degrees or-90 degrees with respect to the centerline, or-120 degrees, 0 degrees and +120 degrees with respect to the centerline, or wherein the different positions or orientations comprise at least one position or orientation pointing upwards or downwards with respect to a horizontal plane in which a virtual listener is located, wherein direction information about the plurality of audio objects relates to the virtual listener position or reference position or orientation.

5. The apparatus of one of the preceding claims, further comprising:

a parameter processor (110) for quantizing metadata indicating direction information about the plurality of audio objects to obtain quantized direction terms of the plurality of audio objects,

wherein the downmixer (400) is configured to: operating in response to the quantized direction item as the direction information, and

6. The device according to any of the preceding claims,

wherein the downmixer (400) is configured to: an analysis is performed on the directional information about the plurality of audio objects and one or more virtual microphones are placed according to the result of the analysis to generate the transmission channel.

7. The device according to any of the preceding claims,

Wherein the direction information is variable over a plurality of time frames, and wherein the downmixer (400) is configured to downmix (405) using a downmix rule that is variable over the plurality of time frames.

8. The device according to any of the preceding claims,

wherein the downmixer (400) is configured to: the samples of the plurality of audio objects are downmixed in the time domain using a sample-by-sample weighted sum combination.

9. The apparatus of one of the preceding claims, further comprising:

Wherein the output interface (200) is configured to: information about parametric data for the at least two related audio objects for the one or more frequency bins is introduced into the encoded audio signal.

10. The apparatus of claim 9, wherein the object parameter calculator (100) is configured to:

converting (120) each of the plurality of audio objects into a spectral representation having the plurality of frequency bins,

calculating (122) selection information for each audio object for the one or more frequency bins, and

11. The apparatus of claim 9 or 10, wherein the object parameter calculator (100) is configured to: quantizing and encoding (212) one or more amplitude-related measures of the related audio object in the one or more frequency bins or one or more combined values derived from amplitude-related measures, as the parameter data, and

Wherein the output interface (200) is configured to: the quantized one or more amplitude-related measurement values or the quantized one or more combined values are introduced into the encoded audio signal.

12. The device according to claim 10 or 11,

wherein the object parameter calculator (100) is configured to: calculating (127) a combined value, e.g. a ratio of an amplitude-related measure of the related audio object to a sum of two or more amplitude-related measures of the related audio object, and

13. The device according to claim 10 to 12,

wherein the object parameter calculator (100) is configured to: the object identification is selected based on an order of selection information of the plurality of audio objects in the one or more frequency bins.

14. The apparatus according to one of claims 10 to 13, wherein the object parameter calculator (100) is configured to:

calculating (122) signal power as the selection information,

calculating (126) a power ratio between a sum of signal powers of two or more audio objects having the maximum signal power value and a signal power of at least one of the audio objects having the derived object identification as the parameter data, and

quantizing and encoding (212) the power ratio, and

wherein the output interface (200) is configured to: the quantized and encoded power ratio is introduced into the encoded audio signal.

15. The apparatus of one of claims 10 to 14, wherein the output interface (200) is configured to: the following are introduced into the encoded audio signal:

one or more of the encoded transmission channels,

two or more coding object identifications of related audio objects for each of one or more of the plurality of frequency bins in the time frame, and one or more coding combination values or coding amplitude related measurement values as the parameter data, and

Quantization and encoding direction data for each audio object in the time frame, the direction data being constant for all of the one or more frequency bins.

16. The apparatus of one of claims 9 to 15, wherein the object parameter calculator (100) is configured to: calculating parameter data of at least the most significant object and the second most significant object in the one or more frequency bins, or

17. The apparatus of one of claims 9 to 16, wherein the object parameter calculator (100) is configured to:

calculating raw parametric data having a first time or frequency resolution and combining the raw parametric data into combined parametric data having a second time or frequency resolution lower than the first time or frequency resolution, and calculating parametric data of at least two related audio objects with respect to the combined parametric data having the second time or frequency resolution, or

A parameter band having a second time or frequency resolution different from the first time or frequency resolution used in the time or frequency decomposition of the plurality of audio objects is determined, and parameter data of at least two related audio objects is calculated for the parameter band having the second time or frequency resolution.

18. A decoder for decoding an encoded audio signal, the encoded audio signal comprising: one or more transmission channels of a plurality of audio objects and direction information; and parameter data of the audio object for one or more frequency bins of the time frame, the decoder comprising:

An audio renderer (700) for rendering the one or more transmission channels into a plurality of audio channels using the direction information,

wherein the audio renderer (700) is configured to: direct response information (704) is calculated from the one or more audio objects of each of the plurality of frequency bins and direction information (810) associated with one or more related audio objects of the frequency bins.

19. The decoder according to claim 18,

wherein the audio renderer (700) is configured to: calculating (706) covariance synthesis information using the direct response information and information (702) about the plurality of audio channels, and applying (727) the covariance synthesis information to the one or more transmission channels to obtain the plurality of audio channels, or

Wherein the direct response information (704) is a direct response vector for each of one or more audio objects, and wherein the covariance synthesis information is a covariance synthesis matrix, and wherein the audio renderer (700) is configured to: a matrix operation is performed for each frequency bin when applying (727) the covariance synthesis information.

20. The decoder according to claim 18 or 19, wherein the audio renderer (700) is configured to:

in calculating the direct response information (704), deriving a direct response vector for the one or more audio objects, and for the one or more audio objects, calculating a covariance matrix from each direct response vector,

in calculating the covariance synthesis information, target covariance information is derived (724) from:

a covariance matrix of one audio object or covariance matrices of a plurality of audio objects,

power information about a corresponding one or more audio objects, and

power information derived from the one or more transmission channels.

21. The decoder of claim 20, wherein the audio renderer (700) is configured to:

in calculating the direct response information, direct response vectors for the one or more audio objects are derived, and for each one or more audio objects, a covariance matrix is calculated (723) from each direct response vector,

deriving (726) input covariance information from the transmission channel, and

deriving (725 a,725 b) mixing information from the target covariance information, the input covariance information, and information about a plurality of channels, and

22. The decoder according to claim 21, wherein the result of applying the mixing information for each frequency interval in the time frame is converted (708) into the time domain to obtain a plurality of audio channels in the time domain.

23. The decoder according to one of claims 18 to 22, wherein the audio renderer (700) is configured to:

24. Decoder according to one of claims 18 to 23, wherein the parametric data of the one or more audio objects comprises parametric data of at least two related audio objects, wherein the number of the at least two related audio objects is lower than the total number of the plurality of audio objects, and

25. The decoder according to claim 24,

26. The decoder according to claim 24 or 25,

wherein the encoded audio signal comprises an amplitude related measurement value of each related audio object or a combined value related to at least two related audio objects in the parameter data, and

wherein the audio renderer (700) is configured to: the method further comprises operating taking contributions from the one or more transmission channels into account, or determining a quantitative contribution of the one or more transmission channels from the amplitude-related measurement value or the combined value, depending on first direction information associated with a first of the at least two related audio objects and depending on second direction information associated with a second of the at least two related audio objects.

27. The decoder of claim 26, wherein the encoded signal comprises a combined value in the parameter data, and

wherein the audio renderer (700) is configured to: determining a contribution of the one or more transmission channels using the combined value of one of the related audio objects and the directional information of the one related audio object, and

wherein the audio renderer (700) is configured to: the contribution of the one or more transmission channels is determined using a value derived from the combined value of another of the related audio objects in the one or more frequency bins and the direction information of the other related audio object.

28. The decoder according to one of claims 24 to 27, wherein the audio renderer (700) is configured to:

the direct response information is calculated from the associated audio object for each of the plurality of frequency bins and the direction information associated with the associated audio object in the frequency bin (704).

29. The decoder according to claim 28,

wherein the audio renderer (700) is configured to: a spread signal for each of the plurality of frequency bins is determined (741) using diffuseness information, such as diffuseness parameters, or a decorrelation rule, included in the metadata, and a direct response determined from the direct response information and the spread signal are combined to obtain a spectral domain rendering signal for a channel of the plurality of channels.

30. A method of encoding a plurality of audio objects and associated metadata indicative of directional information about the plurality of audio objects, comprising:

31. A method of decoding an encoded audio signal, the encoded audio signal comprising: one or more transmission channels of a plurality of audio objects and direction information; and parameter data for an audio object for one or more frequency bins of a time frame, the method comprising:

wherein the audio rendering comprises: direct response information is calculated from one or more audio objects of each of the plurality of frequency bins and direction information associated with one or more related audio objects of the frequency bins.

32. A computer program for performing the method of claim 30 or the method of claim 31 when run on a computer or processor.