EP2690621A1 - Method and Apparatus for downmixing MPEG SAOC-like encoded audio signals at receiver side in a manner different from the manner of downmixing at encoder side - Google Patents

Method and Apparatus for downmixing MPEG SAOC-like encoded audio signals at receiver side in a manner different from the manner of downmixing at encoder side Download PDF

Info

Publication number
EP2690621A1
EP2690621A1 EP12305914.9A EP12305914A EP2690621A1 EP 2690621 A1 EP2690621 A1 EP 2690621A1 EP 12305914 A EP12305914 A EP 12305914A EP 2690621 A1 EP2690621 A1 EP 2690621A1
Authority
EP
European Patent Office
Prior art keywords
data
matrix
signals
downmixing
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP12305914.9A
Other languages
German (de)
French (fr)
Inventor
Oliver Wuebbolt
Adrian Murtaza
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Thomson Licensing SAS
Original Assignee
Thomson Licensing SAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thomson Licensing SAS filed Critical Thomson Licensing SAS
Priority to EP12305914.9A priority Critical patent/EP2690621A1/en
Publication of EP2690621A1 publication Critical patent/EP2690621A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0204Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition

Definitions

  • the invention relates to a method and to an apparatus for downmixing MPEG SAOC-like encoded audio signals at receiver side in a manner different from the manner of downmixing at encoder side, wherein the decoder side downmixing is controlled by desired playback configuration data and/or desired object positioning data.
  • Audio content providers are facing consumers with increasingly heterogeneous listening situations, e.g. home theatres, mobile audio, car and in-flight entertainment. Audio content cannot be processed by its creator or broadcaster so as to match every possible consumer listening condition, for example audio/video content played back on a mobile phone. Besides different listening conditions, also different listening experiences can be desirable, for instance in a live soccer broadcast a consumer can control his own virtual position within the sound scene of the stadium (pitch or stands), or can control the virtual position and the predominance of the commentator.
  • Metadata providers can add guiding metadata to the audio content, such that consumers can control down-mix or dynamic range of selected parts of the audio signal and/or assure high speech intelligibility.
  • metadata For incorporation of such metadata into existing broadcasting chains, it is important that the general audio format is not changed (legacy playback) and that only a small amount of extra information (e.g. as ancillary data) is added to the audio bit stream.
  • MPEG Spatial Audio Object Coding (SAOC), ISO/IEC 23003-1:2007, MPEG audio technologies - Part 1: MPEG Surround, and ISO/IEC 23003-2:2010, MPEG audio technologies - Part 2: Spatial Audio Object Coding, deals with parametric coding techniques for complex audio scenes at bit rates normally used for mono or stereo sound coding, offering at decoder side an interactive rendering of the audio objects mixed into the audio scene.
  • SAOC MPEG Spatial Audio Object Coding
  • ISO/IEC 23003-1 2007
  • MPEG audio technologies - Part 1 MPEG Surround
  • ISO/IEC 23003-2:2010 MPEG audio technologies - Part 2: Spatial Audio Object Coding
  • MPEG SAOC was developed starting from Spatial Audio Coding which is based on a 'channel-oriented' approach, by introducing the concept of audio objects, having the purpose to offer even more flexibility at receiver side. Since it is a parametric multiple object coding technique, the additional cost in terms of bit rate is limited to 2-3 kbit/s for each audio object. Although the bit rate increases with the number of audio objects, it still remains at small values in comparison with the actual audio data transmitted as a mono/stereo downmix.
  • a standard MPEG SAOC architecture is illustrated in Fig. 1 .
  • a variable number N of sound objects Obj.#1, Obj.#2, Obj.#3, Obj.#4 is input to an SAOC encoder 11 that provides an encoded background compatible mono or stereo downmix signal or signals with SAOC parameters and side information.
  • the downmix data signal or signals and the side information data are sent as dedicated bit streams to an SAOC decoder 12.
  • the receiver side processing is carried out basically in two steps: first, using the side information, SAOC decoder 12 reconstructs the original sound objects Obj.#1, Obj.#2, Obj.#3, Obj.#4. Controlled by interaction/control information, these reconstructed sound objects are re-mixed and rendered in an MPEG surround renderer 13. For example, the reconstructed sound objects are output as a stereo signal with channels #1 and #2.
  • SAOC parameters are computed for every time/frequency tile and are transmitted to the decoder side:
  • a Distortion Control Unit DCU can be used.
  • the final rendering matrix coefficients are computed as a linear combination of user-specified coefficients and the target coefficients which are assumed to be distortion-free.
  • the main drawback of the solution offered by MPEG SAOC is the limitation to a maximum of two down-mix channels. Further, the MPEG SAOC standard is not designed for 'Solo/Karaoke' type of applications, which involve the complete suppression of one or more audio objects. In the MPEG SAOC standard this problem is tackled by using residual coding for specific audio objects, thereby increasing the bit rate.
  • a problem to be solved by the invention is to overcome these limitations of MPEG SAOC and to allow for adding of side information for a legacy multi-channel audio broadcasting like 5.1.
  • This problem is solved by the methods disclosed in claims 1 and 2. Apparatuses that utilise these methods are disclosed in claims 3 and 4, respectively.
  • the invention describes how, by adding only a small amount of extra bit rate, at decoder or receiver side a re-mixing of a broadcast audio signal is achieved using information about the actual mix of the audio signal, audio signal characteristics like correlation, level differences, and the desired audio scene rendering.
  • a second embodiment shows how to determine already at encoder side the suitability of the actual multi-channel audio signal for a remix at decoder side.
  • This feature allows for countermeasures (e.g. changing the mixing matrix used, i.e. how the sound objects are mixed into the different channels) if a decoder side re-mixing without perceivable artefacts, or without problems like the necessary additional transmission of the audio objects itself for a short time, is not possible.
  • the invention could be used to amend the MPEG SAOC standard correspondingly, based on the same building blocks.
  • the inventive encoding method is suited for downmixing spatial audio signals that can be downmixed at receiver side in a manner different from the manner of downmixing at encoder side, wherein said encoding is based on MPEG SAOC and said downmixing at receiver side can be controlled by desired playback configuration data and/or desired object positioning data, said method including the steps:
  • the inventive decoding method is suited downmixing spatial audio signals processed according to the encoding method in a manner different from the manner of downmixing at encoder side, wherein said downmixing at receiver side can be controlled by desired playback configuration data and/or desired object positioning data, said method including the steps:
  • the inventive encoding apparatus is suited for downmixing spatial audio signals that can be downmixed at receiver side in a manner different from the manner of downmixing at encoder side, wherein said encoding is based on MPEG SAOC and said downmixing at receiver side can be controlled by desired playback configuration data and/or desired object positioning data, said apparatus including:
  • the inventive decoding apparatus is suited for downmixing spatial audio signals processed according to the encoding method in a manner different from the manner of downmixing at encoder side, wherein said downmixing at receiver side can be controlled by desired playback configuration data and/or desired object positioning data, said apparatus including:
  • the inventive spatial audio object coding system with five down-mix channels facilitates a backward compatible transmission at bit rates only slighter higher (due to the extended content of the side information: OLD, IOC, DMG and DCLD ) than the bit rates for known 5.1 channel transmission.
  • a number of M 5 channels containing the ambience signals and a number of L audio objects mixed over the ambience are considered.
  • An example is the stadium ambience of a soccer match plus specific sound effects (ball kicks, whistle) and one or more commentators.
  • M is at least '2' and L is '1' or greater.
  • At decoder side it is not intended to reconstruct the audio objects, but to offer the possibility of re-mixing, attenuating, totally suppressing, and changing the position of the audio objects in the rendered audio scene.
  • any time/frequency transform can be used.
  • hybrid Quadrature Mirror Filter (hQMF) banks are used for better selectivity in the frequency domain.
  • the spatial audio input signals are processed in non-overlapping multiple-sample temporal slots, in particular 64-samples temporal slots. These temporal slots are used for computing the perceptual cues or characteristics for every successive frame, which has a length of a fixed number of temporal slots, in particular 16 temporal slots.
  • 71 frequency bands are used according to the sensitivity of the human auditory system, and are grouped into K processing bands, K having a value of '2', '3' or '4', thereby obtaining different levels of accuracy.
  • the hQMF filter bank transforms in each case 64 time samples into 71 frequency samples.
  • FIG.2 A basic block diagram of the inventive encoder and decoder is shown in Fig.2 and Fig.3 , respectively.
  • a hybrid Quadrature Mirror Filter analysis filter bank step or stage 21 is applied to all audio input signals, e.g. ambience channels Ch.#1 to Ch.#5 and sound objects Obj.#1 to Obj.# L (at least one sound object).
  • the number of ambience channels is not limited to five.
  • the ambience channels are not independent sound objects but are usually correlated sound signals.
  • the corresponding filter bank outputs time/frequency domain signals X, which are fed on one hand to a down-mixer step or stage 22 that multiplies them with a downmix matrix D and provides, via a hybrid Quadrature Mirror Filter synthesis filter bank step or stage 24 that has an inverse operation of the analysis filter bank, the audio channels in time domain for transmission, and on the other hand are fed to an enhanced MPEG SAOC parameter calculator step or stage 23.
  • the synthesis filter bank inverses the function of the analysis filter bank.
  • the enhanced SAOC parameters determine the rendering flexibility in a decoder and include, as mentioned above, Object Level Differences data OLD, Inter-Object Coherence data IOC, Downmix Gains data DMG, Downmix Channel Level Differences data DCLD, and can comprise Object Energy parameter data NRG. Except the DMG and DCLD data, the other parameters correspond to original MPEG SAOC parameters. From these side information data items a rank can be determined as described below in a rank calculator step or stage 25 (i.e. step/stage 25 is optional), and the side information data items and data items regarding any re-mix constraints (described below) are transmitted.
  • the output signals of step/stage 24 together with the output signals of step/stage 25 are used to form an enhanced MPEG SAOC bitstream.
  • the object parameters are quantised and coded efficiently, and are correspondingly decoded and inversely quantised at receiver side.
  • the downmix signal Before transmission, the downmix signal can be compressed, and is correspondingly decompressed at receiver side.
  • the SAOC side information is (or can be) encoded according to the MPEG SAOC standard and is transmitted together with DMG and DCLG data e.g. as ancillary data portion of the downmix bitstream.
  • a hybrid Quadrature Mirror Filter analysis filter bank step or stage 31 corresponding to filter bank 21 receives and processes the transmitted hQMF synthesised data from the enhanced MPEG SAOC bitstream, and feds them as time/frequency domain data to an enhanced MPEG SAOC decoder or transcoder step or stage 32 that is controlled by an estimation matrix T .
  • a rendering matrix step or stage 35 receives user data regarding perceptual cues, e.g. a desired playback configuration and desired object positions, as well as the transmitted re-mix constraint data items, and there from a corresponding rendering matrix A is determined.
  • Matrix A has the size of down-mixing matrix D and its coefficients are based on the coefficients of matrix D.
  • the last row of rendering matrix A contains zero values only and all other matrix coefficients are identical to the coefficients in matrix D. I.e., each sound object is represented by a different row in matrix A.
  • Matrix A together with an estimated covariance matrix C and a reconstructed down-mixing matrix D, are used for determining (as explained below) in an estimation matrix generator step or stage 36 the estimation matrix T.
  • the estimated covariance matrix C and the reconstructed down-mixing matrix D are determined from the received side information data in a covariance and down-mixing matrix calculation step or stage 34.
  • the estimation matrix T is used for decoding or transcoding the audio signals of the new audio scene.
  • the downmixed signals of step/stage 32 are output as channel signals (e.g. Ch.#1 to Ch.#5) via a hybrid Quadrature Mirror Filter synthesis filter bank step or stage 33 (corresponding to synthesis filter bank 24).
  • the perceptual cues are used as ancillary data in the main bitstream.
  • Minimum bit rate means: such that the resulting audio quality is not affected, i.e. the distortions caused due to slightly less available bit rate for the audio signals are not audible, or at least not annoying.
  • the Object Level Differences data OLD and Inter Object Coherence data IOC are used, the values of which are computed in step/stage 23 e.g. according to section Annex D.2 "Calculation of SAOC parameters" of the MPEG SAOC standard, for every frame/frequency processing band tile ( l,m ), i.e. for every non-overlapping 16 temporal slots and every K processing bands.
  • nrg l , m i ⁇ j ⁇ t ⁇ l ⁇ k ⁇ m X t , k i * X t , k * j ⁇ t ⁇ l ⁇ k ⁇ m 1 + ⁇ , where the indices i and j stand for the ambience channel number and the audio object number, respectively, m is a current frequency processing band, k is a running frequency sample index within frequency processing band m , l is a current frame, t is a running temporal slot index within frame l , and ⁇ has a small value (e.g. 10- 9 ) and avoids a division by zero in the following computations.
  • the DMG and DCLD data or parameters are extended versions of the corresponding MPEG SAOC parameters because they are not limited to two channels like in MPEG SAOC.
  • the time/frequency resolution of DMG and DCLD can be modified with the moving speed of the audio objects in the audio scene. This time/frequency resolution change will not affect the performance of the inventive processing, and therefore it is assumed for simplicity that the time/frequency resolution at which these parameters are computed is equal to the time/frequency resolution at which the processing is done.
  • the perceptual cues are used in step/stage 34 for approximating the covariance matrix C of the original input channels.
  • the original rendering or mixing matrix is required at decoder side.
  • the downmix Gains DMG values and the Downmix Channel Level Differences DCLD values from the additional side information are used.
  • Matrix D is computed differently than according to MPEG SAOC, but its resulting content can assumed to be identical.
  • every frame l and processing band m (cf. the above table), mixing matrix D l,m ( i,j ) and the rendering matrix A l,m ( i,j ) used for remixing the L objects have a size 5 by N in this embodiment.
  • T l,m argmin ⁇ E l ⁇ Z t , m - Z ⁇ t , m ⁇ Z t , m - Z ⁇ t , m H .
  • Z t,m is not available at decoder side but is used for the derivation of the following equations, and it finally turns out that knowledge of Z t,m at decoder side is not required.
  • the rank of this matrix G l,m is computed in step/stage 25 before final encoding, and one can proceed with the encoding of the side information if that matrix has a full rank.
  • the rank of a matrix is the number of independent columns or rows, and this can be hard to determine.
  • the effective rank is used, which is a more stable measure for the rank and is described in section 6.3 "Singular Value Decomposition" of the textbook of Gilbert Strang, "Linear Algebra and its applications", 4th edition, published 19 July 2005 .
  • the eigenvalues of matrix G l,m are computed.
  • the rank value can be used for controlling the number K of frequency bands applied in the inventive processing, and thereby the accuracy of the side information parameters.
  • the rank value can also be used for switching on or off a residual coding like in the MPEG SAOC standard.
  • G l , m Q l , m ⁇ U l , m ⁇ Q l , m - 1
  • Q l,m is a unitary matrix
  • Q l , m - 1 Q l , m H
  • a singular matrix is not very common and it can become non-singular with a small weighting of one coefficient of the singular matrix.
  • matrix G l,m is invertible
  • these eigenvalues are modified by adding a weight of ⁇ to each one of them.
  • the error introduced by this procedure is of order ⁇ and will not affect the remixing processing in step/stage 32.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Content providers can add metadata to audio content such that consumers can control down-mix or dynamic range of selected parts of the audio signal. MPEG Spatial Audio Object Coding (SAOC) deals with parametric coding techniques for complex audio scenes at bit rates normally used for mono or stereo sound coding, offering at decoder side an interactive rendering of the audio objects mixed into the audio scene, whereby only a small amount of extra information is added to the audio bit stream. In parametric coding the biggest issue is that a perfect separation between objects is usually not possible. This issue is treated in the MPEG SAOC standard by using residual coding techniques and ensuring better separation only for a small set of objects before encoding. The invention describes how, by adding only a small amount of extra information to SAOC parameters, at receiver side a remixing of a broadcast audio signal is achieved, by using information about the actual mix of the audio signal, audio signal characteristics like correlation, and the desired audio scene rendering.

Description

  • The invention relates to a method and to an apparatus for downmixing MPEG SAOC-like encoded audio signals at receiver side in a manner different from the manner of downmixing at encoder side, wherein the decoder side downmixing is controlled by desired playback configuration data and/or desired object positioning data.
  • Background
  • Content providers are facing consumers with increasingly heterogeneous listening situations, e.g. home theatres, mobile audio, car and in-flight entertainment. Audio content cannot be processed by its creator or broadcaster so as to match every possible consumer listening condition, for example audio/video content played back on a mobile phone. Besides different listening conditions, also different listening experiences can be desirable, for instance in a live soccer broadcast a consumer can control his own virtual position within the sound scene of the stadium (pitch or stands), or can control the virtual position and the predominance of the commentator.
  • Content providers can add guiding metadata to the audio content, such that consumers can control down-mix or dynamic range of selected parts of the audio signal and/or assure high speech intelligibility. For incorporation of such metadata into existing broadcasting chains, it is important that the general audio format is not changed (legacy playback) and that only a small amount of extra information (e.g. as ancillary data) is added to the audio bit stream. MPEG Spatial Audio Object Coding (SAOC), ISO/IEC 23003-1:2007, MPEG audio technologies - Part 1: MPEG Surround, and ISO/IEC 23003-2:2010, MPEG audio technologies - Part 2: Spatial Audio Object Coding, deals with parametric coding techniques for complex audio scenes at bit rates normally used for mono or stereo sound coding, offering at decoder side an interactive rendering of the audio objects mixed into the audio scene.
  • MPEG SAOC was developed starting from Spatial Audio Coding which is based on a 'channel-oriented' approach, by introducing the concept of audio objects, having the purpose to offer even more flexibility at receiver side. Since it is a parametric multiple object coding technique, the additional cost in terms of bit rate is limited to 2-3 kbit/s for each audio object. Although the bit rate increases with the number of audio objects, it still remains at small values in comparison with the actual audio data transmitted as a mono/stereo downmix.
  • In parametric coding the biggest issue is that a perfect separation between objects is usually not possible, especially in an extreme mixing scenario, for example when all sound objects are mixed in all channels, or in case of 'applause-like' sound. This issue is treated in the MPEG SAOC standard by using residual coding techniques and ensuring better separation only for a small set of objects before encoding.
  • A standard MPEG SAOC architecture is illustrated in Fig. 1. A variable number N of sound objects Obj.#1, Obj.#2, Obj.#3, Obj.#4 is input to an SAOC encoder 11 that provides an encoded background compatible mono or stereo downmix signal or signals with SAOC parameters and side information. The downmix data signal or signals and the side information data are sent as dedicated bit streams to an SAOC decoder 12. The receiver side processing is carried out basically in two steps: first, using the side information, SAOC decoder 12 reconstructs the original sound objects Obj.#1, Obj.#2, Obj.#3, Obj.#4. Controlled by interaction/control information, these reconstructed sound objects are re-mixed and rendered in an MPEG surround renderer 13. For example, the reconstructed sound objects are output as a stereo signal with channels #1 and #2.
  • In practice, in order to avoid changing twice the temporal domain and the frequency domain, the steps are merged, thereby substantially reducing the computational complexity. The processing in SAOC encoder 11, SAOC decoder 12 and renderer 13 is performed in the frequency domain, using a nonuniform scale in order to model most efficiently the human auditory system. To ensure compatibility with MPEG Surround, a hybrid Quadrature Mirror Filter hQMF 21 shown in Fig. 2 was chosen, which filter offers a better resolution for low frequencies. After applying the hQMF filter to all input signals, a time/frequency grid is obtained that maps every time slot or frame of the input signals to a number of processing bands obtained by merging the frequency bands.
  • The following SAOC parameters are computed for every time/frequency tile and are transmitted to the decoder side:
    • Object Level Differences data OLD describe the amount of energy contained by each sound object with respect to the sound object having the highest energy. The energy level value of the loudest object is described by the Object Energy parameter data NRG, which can be transmitted to the decoder;
    • Inter-Object Coherence data IOC are used to describe the amount of similarity between the sound objects and are computed for every pair of two input audio objects;
    • Downmix Gains data DMG and Downmix Channel Level Differences data DCLD are used for describing the gains applied to each sound object in the mixing process, and at decoder side they are used for reconstructing the downmixing matrix.
  • At decoder side, in order to avoid generation of bad quality audio content due to extreme rendering, a Distortion Control Unit DCU can be used. The final rendering matrix coefficients are computed as a linear combination of user-specified coefficients and the target coefficients which are assumed to be distortion-free.
  • Invention
  • The main drawback of the solution offered by MPEG SAOC is the limitation to a maximum of two down-mix channels. Further, the MPEG SAOC standard is not designed for 'Solo/Karaoke' type of applications, which involve the complete suppression of one or more audio objects. In the MPEG SAOC standard this problem is tackled by using residual coding for specific audio objects, thereby increasing the bit rate.
  • A problem to be solved by the invention is to overcome these limitations of MPEG SAOC and to allow for adding of side information for a legacy multi-channel audio broadcasting like 5.1. This problem is solved by the methods disclosed in claims 1 and 2. Apparatuses that utilise these methods are disclosed in claims 3 and 4, respectively.
  • The invention describes how, by adding only a small amount of extra bit rate, at decoder or receiver side a re-mixing of a broadcast audio signal is achieved using information about the actual mix of the audio signal, audio signal characteristics like correlation, level differences, and the desired audio scene rendering.
  • A second embodiment shows how to determine already at encoder side the suitability of the actual multi-channel audio signal for a remix at decoder side. This feature allows for countermeasures (e.g. changing the mixing matrix used, i.e. how the sound objects are mixed into the different channels) if a decoder side re-mixing without perceivable artefacts, or without problems like the necessary additional transmission of the audio objects itself for a short time, is not possible.
  • Advantageously, due to processing the same type of side information parameters, the invention could be used to amend the MPEG SAOC standard correspondingly, based on the same building blocks.
  • In principle, the inventive encoding method is suited for downmixing spatial audio signals that can be downmixed at receiver side in a manner different from the manner of downmixing at encoder side, wherein said encoding is based on MPEG SAOC and said downmixing at receiver side can be controlled by desired playback configuration data and/or desired object positioning data, said method including the steps:
    • processing M correlated sound signals, M being greater than '2', and L independent sound signals, L being '1' or greater, in an analysis filter bank providing corresponding time/frequency domain signals;
    • multiplying said time/frequency domain signals with a downmix matrix Dl,m, followed by processing the resulting signals in a synthesis filter bank that has an inverse operation of said analysis filter bank and that provides M time domain output signals;
    • determining from said time/frequency domain signals MPEG SAOC side information data including Object Level Differences data OLD and Inter-Object Coherence data IOC, as well as enhanced Downmix Gains data DMG and Downmix Channel Level Differences data DCLD, wherein said DMG and DCLD data are related to M channels.
  • In principle, the inventive decoding method is suited downmixing spatial audio signals processed according to the encoding method in a manner different from the manner of downmixing at encoder side, wherein said downmixing at receiver side can be controlled by desired playback configuration data and/or desired object positioning data, said method including the steps:
    • receiving said processed spatial audio signals and processing them in an analysis filter bank, providing corresponding time/frequency domain signals;
    • determining from said desired playback configuration data and/or said desired object positioning data a rendering matrix Al,m ;
    • determining from the received OLD, IOC, DMG and DCLD data an estimated covariance matrix Cl,m and a reconstructed down-mixing matrix Dl,m ;
    • calculating an estimation matrix T l , m = A l , m C l , m D l , m H D l , m C l , m D l , m H - 1 ;
      Figure imgb0001
    • multiplying said time/frequency domain signals (Y) with said estimation matrix Tl,m so as to get desired-remix signals, followed by processing said desired-remix signals in a synthesis filter bank that has an inverse operation of said analysis filter bank.
  • In principle, the inventive encoding apparatus is suited for downmixing spatial audio signals that can be downmixed at receiver side in a manner different from the manner of downmixing at encoder side, wherein said encoding is based on MPEG SAOC and said downmixing at receiver side can be controlled by desired playback configuration data and/or desired object positioning data, said apparatus including:
    • an analysis filter bank for processing M correlated sound signals, M being greater than '2', and L independent sound signals, L being '1' or greater, providing corresponding time/frequency domain signals;
    • means being adapted for multiplying said time/frequency domain signals with a downmix matrix Dl,m ;
    • a synthesis filter bank for said multiplied time/frequency domain signals that has an inverse operation of said analysis filter bank and that provides M time domain output signals;
    • means being adapted for determining from said time/frequency domain signals MPEG SAOC side information data including Object Level Differences data OLD and Inter-Object Coherence data IOC, as well as enhanced Downmix Gains data DMG and Downmix Channel Level Differences data DCLD, wherein said DMG and DCLD data are related to M channels.
  • In principle, the inventive decoding apparatus is suited for downmixing spatial audio signals processed according to the encoding method in a manner different from the manner of downmixing at encoder side, wherein said downmixing at receiver side can be controlled by desired playback configuration data and/or desired object positioning data, said apparatus including:
    • means being adapted for receiving said processed spatial audio signals and for processing them in an analysis filter bank, providing corresponding time/frequency domain signals;
    • means being adapted for determining from said desired playback configuration data and/or said desired object positioning data a rendering matrix Al,m ;
    • means being adapted for determining from the received OLD, IOC, DMG and DCLD data an estimated covariance matrix Cl,m and a reconstructed down-mixing matrix Dl,m;
    • means (36) being adapted for calculating an estimation matrix T l , m = A l , m C l , m D l , m H D l , m C l , m D l , m H - 1 ;
      Figure imgb0002
    • means being adapted for multiplying said time/frequency domain signals with said estimation matrix Tl,m so as to get desired-remix signals, followed by processing said desired-remix signals in a synthesis filter bank that has an inverse operation of said analysis filter bank.
  • Advantageous additional embodiments of the invention are disclosed in the respective dependent claims.
  • Drawings
  • Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in:
  • Fig. 1
    standard MPEG SAOC system;
    Fig. 2
    enhanced MPEG SAOC encoder;
    Fig. 3
    enhanced MPEG SAOC decoder.
    Exemplary embodiments
  • The inventive spatial audio object coding system with five down-mix channels facilitates a backward compatible transmission at bit rates only slighter higher (due to the extended content of the side information: OLD, IOC, DMG and DCLD) than the bit rates for known 5.1 channel transmission. By using the side information, a guided spatial remix is achieved, as well as the adaption of the audio mix to different listening situations, including guided downmix (or even upmix) of multi-channel audio.
  • In the following embodiment, a number of M=5 channels containing the ambience signals and a number of L audio objects mixed over the ambiance are considered. An example is the stadium ambiance of a soccer match plus specific sound effects (ball kicks, whistle) and one or more commentators. Thus, the encoder input gets N = M+L audio channels.
  • In other embodiments, M is at least '2' and L is '1' or greater.
  • At decoder side it is not intended to reconstruct the audio objects, but to offer the possibility of re-mixing, attenuating, totally suppressing, and changing the position of the audio objects in the rendered audio scene.
  • For the processing part of the system any time/frequency transform can be used. In this embodiment, hybrid Quadrature Mirror Filter (hQMF) banks are used for better selectivity in the frequency domain. The spatial audio input signals are processed in non-overlapping multiple-sample temporal slots, in particular 64-samples temporal slots. These temporal slots are used for computing the perceptual cues or characteristics for every successive frame, which has a length of a fixed number of temporal slots, in particular 16 temporal slots.
  • In the frequency domain, 71 frequency bands are used according to the sensitivity of the human auditory system, and are grouped into K processing bands, K having a value of '2', '3' or '4', thereby obtaining different levels of accuracy. The hQMF filter bank transforms in each case 64 time samples into 71 frequency samples. The processing band borders are represented in the following table: Table 2.1: Processing Bands according to different quality levels
    K=4 K=3 K=2
    Processing Band 1 0-7 0-7 0-20
    Processing Band 2 8-20 8-55 21-70
    Processing Band 3 21-29 56-70 -
    Processing Band 4 30-70 - -
  • A basic block diagram of the inventive encoder and decoder is shown in Fig.2 and Fig.3, respectively.
  • In the encoder in Fig.2, a hybrid Quadrature Mirror Filter analysis filter bank step or stage 21 is applied to all audio input signals, e.g. ambience channels Ch.#1 to Ch.#5 and sound objects Obj.#1 to Obj.#L (at least one sound object). The number of ambience channels is not limited to five. In this invention, contrary to MPEG SAOC, the ambience channels are not independent sound objects but are usually correlated sound signals. The corresponding filter bank outputs time/frequency domain signals X, which are fed on one hand to a down-mixer step or stage 22 that multiplies them with a downmix matrix D and provides, via a hybrid Quadrature Mirror Filter synthesis filter bank step or stage 24 that has an inverse operation of the analysis filter bank, the audio channels in time domain for transmission, and on the other hand are fed to an enhanced MPEG SAOC parameter calculator step or stage 23. The synthesis filter bank inverses the function of the analysis filter bank. The enhanced SAOC parameters determine the rendering flexibility in a decoder and include, as mentioned above, Object Level Differences data OLD, Inter-Object Coherence data IOC, Downmix Gains data DMG, Downmix Channel Level Differences data DCLD, and can comprise Object Energy parameter data NRG. Except the DMG and DCLD data, the other parameters correspond to original MPEG SAOC parameters. From these side information data items a rank can be determined as described below in a rank calculator step or stage 25 (i.e. step/stage 25 is optional), and the side information data items and data items regarding any re-mix constraints (described below) are transmitted.
  • The output signals of step/stage 24 together with the output signals of step/stage 25 are used to form an enhanced MPEG SAOC bitstream.
  • As mentioned in section 5.1 SAOC overview/Introduction of the MPEG SAOC standard, the object parameters are quantised and coded efficiently, and are correspondingly decoded and inversely quantised at receiver side. Before transmission, the downmix signal can be compressed, and is correspondingly decompressed at receiver side. The SAOC side information is (or can be) encoded according to the MPEG SAOC standard and is transmitted together with DMG and DCLG data e.g. as ancillary data portion of the downmix bitstream.
  • In the receiver-side decoder in Fig.3, a hybrid Quadrature Mirror Filter analysis filter bank step or stage 31 corresponding to filter bank 21 receives and processes the transmitted hQMF synthesised data from the enhanced MPEG SAOC bitstream, and feds them as time/frequency domain data to an enhanced MPEG SAOC decoder or transcoder step or stage 32 that is controlled by an estimation matrix T. A rendering matrix step or stage 35 receives user data regarding perceptual cues, e.g. a desired playback configuration and desired object positions, as well as the transmitted re-mix constraint data items, and there from a corresponding rendering matrix A is determined. Matrix A has the size of down-mixing matrix D and its coefficients are based on the coefficients of matrix D. In case e.g. only the 'last' sound object Obj.#L shall not be present in the decoder output signals, the last row of rendering matrix A contains zero values only and all other matrix coefficients are identical to the coefficients in matrix D. I.e., each sound object is represented by a different row in matrix A.
  • Matrix A, together with an estimated covariance matrix C and a reconstructed down-mixing matrix D, are used for determining (as explained below) in an estimation matrix generator step or stage 36 the estimation matrix T.
  • The estimated covariance matrix C and the reconstructed down-mixing matrix D are determined from the received side information data in a covariance and down-mixing matrix calculation step or stage 34. The estimation matrix T is used for decoding or transcoding the audio signals of the new audio scene. The downmixed signals of step/stage 32 are output as channel signals (e.g. Ch.#1 to Ch.#5) via a hybrid Quadrature Mirror Filter synthesis filter bank step or stage 33 (corresponding to synthesis filter bank 24).
  • In order to obtain a minimum bit rate for the side information encoding, the perceptual cues are used as ancillary data in the main bitstream. "Minimum bit rate" means: such that the resulting audio quality is not affected, i.e. the distortions caused due to slightly less available bit rate for the audio signals are not audible, or at least not annoying. For characterising the input audio objects the Object Level Differences data OLD and Inter Object Coherence data IOC are used, the values of which are computed in step/stage 23 e.g. according to section Annex D.2 "Calculation of SAOC parameters" of the MPEG SAOC standard, for every frame/frequency processing band tile (l,m), i.e. for every non-overlapping 16 temporal slots and every K processing bands.
  • First the auto-correlation and the cross-correlation of any two objects are computed and saved in a matrix format: nrg l , m i j = t l k m X t , k i * X t , k * j t l k m 1 + ε ,
    Figure imgb0003

    where the indices i and j stand for the ambience channel number and the audio object number, respectively, m is a current frequency processing band, k is a running frequency sample index within frequency processing band m, l is a current frame, t is a running temporal slot index within frame l, and ε has a small value (e.g. 10-9) and avoids a division by zero in the following computations.
  • The desired perceptual cues OLD and IOC are computed as: OLD l , m i = nrg l , m i i NRG l , m IOC l , m i j = nrg l , m i j nrg l , m i i nrg l , m j j ,
    Figure imgb0004

    where NRGl,m = maxi(nrgl,m (i,i)) represents the absolute object energy of the object with the highest energy.
  • In order to characterise at decoder side the down-mixing matrix coefficients, in step/stage 23 the values for Downmix Gains data DMG and Downmix Channel Level Differences data DCLD are calculated, again for every frame l with 16 temporal slots and frequency processing band m: DMG l , m r = 10 log 10 i = 1 5 D l , m 2 i r + ε DCLD l , m r j = 10 log 10 D l , m 2 1 r + ε D l , m 2 j r ) + ε ,
    Figure imgb0005

    where r=1:N represents the input signal index and j=1:5 (in other embodiments j=1:M) represents the down-mix channel index. The DMG and DCLD data or parameters are extended versions of the corresponding MPEG SAOC parameters because they are not limited to two channels like in MPEG SAOC. Depending on the mixing procedure, the time/frequency resolution of DMG and DCLD can be modified with the moving speed of the audio objects in the audio scene. This time/frequency resolution change will not affect the performance of the inventive processing, and therefore it is assumed for simplicity that the time/frequency resolution at which these parameters are computed is equal to the time/frequency resolution at which the processing is done.
  • At decoder side the perceptual cues are used in step/stage 34 for approximating the covariance matrix C of the original input channels. When considering the definition of the Object Level Differences OLD and Inter Object Coherence IOC, the following form for the estimated covariance matrix C is obtained, computed for every time/frequency (l,m) tile: C l , m i j = OLD l , m i OLD l , m j IOC l , m i j .
    Figure imgb0006
  • In order to remix the audio objects, besides the new rendering matrix also the original rendering or mixing matrix is required at decoder side. For re-constructing the downmixing matrix D in step/stage 34, the Downmix Gains DMG values and the Downmix Channel Level Differences DCLD values from the additional side information are used. The content of the original down-mix matrix D is computed in step/stage 34 as: D l , m 1 r = 10 DMG l , m r 2 1 1 + j = 2 5 10 - 0.1 * DCLD l , m r j D l , m i k = 10 DMG l , m k 2 10 - 0.1 * DCLD l , m r i 1 + j = 2 5 DCLD l , m r j ,
    Figure imgb0007

    where r=1:N represents the input signal index and i=2:5 (in other embodiments i=2:M) represents the down-mix channel index, j represents here a running input signal index within the sum, and k is a running frequency sample index within frequency processing band m. Matrix D is computed differently than according to MPEG SAOC, but its resulting content can assumed to be identical.
  • In the transcoding at decoder side, every frame l and processing band m (cf. the above table), mixing matrix Dl,m (i,j) and the rendering matrix Al,m (i,j) used for remixing the L objects have a size 5 by N in this embodiment. For every time/frequency tile the original down-mixed signals Y at encoder side are, and the desired remixed signals Z at decoder side would be: Y t i k = D l , m i j * X t j k Z t i k = A l , m i j * X t j k ,
    Figure imgb0008
    where tl,km and {i = 1:5, j = 1:N}.
  • For simplicity of notation, the i-j-k indices are dropped, and Yt,m, Zt,m and Xt,m are considered to be matrices having a number of columns equal to the number of frequency subbands in each processing band m: Y t , m = D l , m * X t , m Z t , m = A l , m * X t , m .
    Figure imgb0009
  • I.e., the calculation of Y is carried out for each temporal slot t in a current frame l but for all these temporal slots the same downmix matrix D is used.
  • Using this notation, at encoder side the covariance matrix C of the input signals is defined for every time/frequency tile (l,m) by: C l , m = E l X t , m * X t , m H ,
    Figure imgb0010

    where El [Xt,m ] represents the expectation of frame l of the input signals Xt,m and
    Figure imgb0011
    represents the Hermitian notation of matrix Xt,m, i.e. its conjugate transpose.
  • Using equations (7) and (8), the correlation of the down-mixed signals at encoder side can be expressed as: G l , m = E l Y t , m Y t , m H = E l D l , m X t , m X t , m H D l , m H = D l , m E l X t , m X t , m H D l , m H = D l , m C l , m D l , m H ,
    Figure imgb0012

    and the correlation of the desired remixed signals at decoder side as: F l , m = E l Z t , m Z t , m H = E l A l , m X t , m X t , m H A l , m H = A l , m E l X t , m X t , m H A l , m H = A l , m C l , m A l , m H .
    Figure imgb0013
  • However, because there is no access at decoder side to the original input signals X in order to remix them using the rendering matrix A, the desired remix signals Zt,m are approximated in step/stage 32 as remix signals t,m from the down-mixed signals Y: Z ^ t , m = T l , m * Y t , m ,
    Figure imgb0014

    with tl.
  • The estimation matrix Tl,m should be chosen such that the squared error is minimised: T l , m = argmin E l Z t , m - Z ^ t , m Z t , m - Z ^ t , m H .
    Figure imgb0015

    Remark: Zt,m is not available at decoder side but is used for the derivation of the following equations, and it finally turns out that knowledge of Zt,m at decoder side is not required.
  • Using the 'Orthogonality Principle', it is known that the squared error is minimised when the error is orthogonal on the space spanned by the original down-mix signals. This means that: E l Z t , m - Z ^ t , m Y t , m H = 0 E l Z t , m Y t , m H = E l Z ^ t , m Y t , m H E l A l , m X t , m X t , m H D l , m H = E l T l , m Y t , m Y t , m H A l , m E l X t , m X t , m H D l , m H = T l , m D l , m E l X t , m X t , m H D l , m H A l , m C l , m D l , m H = T l , m D l , m C l , m D l , m H .
    Figure imgb0016
  • If the matrix D l , m C l , m D l , m H
    Figure imgb0017
    is not singular, the estimation matrix Tl,m can be computed at decoder side by inverting matrix Gl,m : T l , m = A l , m C l , m D l , m H D l , m C l , m D l , m H - 1 ,
    Figure imgb0018

    wherein covariance matrix C is estimated at decoder side according to equation (4).
  • Because the expression of G l,m is depending only on parameters known at encoder side and does not depend on the rendering matrix, it can be decided before encoding whether or not remixing is feasible at decoder side. Practice, and the special form of the down-mixing and correlation matrices of the ambiance channels, show that in most cases matrix Gl,m will be invertible.
  • In order to ensure the functionality of the decoding processing, the rank of this matrix Gl,m is computed in step/stage 25 before final encoding, and one can proceed with the encoding of the side information if that matrix has a full rank. The rank of a matrix is the number of independent columns or rows, and this can be hard to determine. Thus, instead of using the rank, the effective rank is used, which is a more stable measure for the rank and is described in section 6.3 "Singular Value Decomposition" of the textbook of Gilbert Strang, "Linear Algebra and its applications", 4th edition, published 19 July 2005. First, the eigenvalues of matrix Gl,m are computed. Next, the number of eigenvalues greater than a tolerance value τ is counted and this number is taken as the effective rank. Practice shows that τ = 10-5 (a preferential range for τ is 10-4...10-8) can be chosen, and for matrices with full effective rank it can be assumed that it is accurate enough for the inverse computation.
  • The rank value can be used for controlling the number K of frequency bands applied in the inventive processing, and thereby the accuracy of the side information parameters. The rank value can also be used for switching on or off a residual coding like in the MPEG SAOC standard.
  • Measuring the sensitivity entirely by the smallest singular value may not be sufficient, because a simple multiplication of the matrix with factor 105 will indicate a much less singular matrix, and an ill-conditioning problem is not solved by a simple re-scaling. Thus, before computing the effective rank as described above, a normalisation of matrix Gl,m is carried out.
  • Because Cl,m represents a correlation matrix of the input audio signals, it will be a symmetric matrix and obviously Gl,m = D l , m C l , m D l , m H
    Figure imgb0019
    will form a symmetric matrix. Therefore, in order to compute the eigenvalues, a Schur decomposition is used for a symmetric matrix: G l , m = Q l , m U l , m Q l , m - 1 ,
    Figure imgb0020

    where Ql,m is a unitary matrix Q l , m - 1 = Q l , m H
    Figure imgb0021
    and Ul,m is a diagonal matrix having on the main diagonal the eigenvalues of G l , m : diag U l , m = λ 1 λ 2 λ 3 λ 4 λ 5 .
    Figure imgb0022
  • An interesting property of the Schur decomposition is that the inverse of Gl,m can be easily computed as: G l , m - 1 = Q l , m - 1 U l , m Q l , m ,
    Figure imgb0023

    where U l,m is a diagonal matrix having on the main diagonal the inverse of the eigenvalues of matrix Gl,m diag(U l,m)= {1/λ1,1/λ2,1/λ3,1/λ4,1/λ5}. Proof: matrix Ul,m is a diagonal matrix and, if the values from the main diagonal are different from zero, the matrix is invertible, whereby the inverse is equal to U l,m. Thus the processing starts with the following computation: I 5 = Q l , m Q l , m - 1 = Q l , m I 5 Q l , m - 1 = Q l , m U l , m U l , m Q l , m - 1 = Q l , m U l , m Q l , m - 1 Q l , m U l , m Q l , m - 1 = G l , m Q l , m U l , m Q l , m - 1 .
    Figure imgb0024
  • Thus Gl,m is invertible with the inverse being equal to G l , m - 1 = Q l , m U l , m Q l , m - 1 .
    Figure imgb0025
  • A singular matrix is not very common and it can become non-singular with a small weighting of one coefficient of the singular matrix. Thus, in order to ensure at decoder side that matrix Gl,m is invertible, after computing the Schur decomposition and finding eigenvalues smaller than the defined tolerance value τ, these eigenvalues are modified by adding a weight of τ to each one of them. In this way, when computing the inverse of Gl,m, using the described property of the Schur decomposition, it is sure that this matrix is well-conditioned. The error introduced by this procedure is of order τ and will not affect the remixing processing in step/stage 32.
  • The values used for Cl,m and Dl,m are estimated according to equations (4) and (5).

Claims (12)

  1. Method for encoding by downmixing (22) spatial audio signals (Ch.#1 - Ch.#5, Obj.#1 - Obj.#L) that can be downmixed at receiver side in a manner different from the manner of downmixing at encoder side, wherein said encoding is based on MPEG SAOC and said downmixing at receiver side can be controlled (35) by desired playback configuration data and/or desired object positioning data, said method including the steps:
    - processing M correlated sound signals (Ch.#1 - Ch.#5), M being greater than '2', and L independent sound signals (Obj.#1 - Obj.#L), L being '1' or greater, in an analysis filter bank (21) providing corresponding time/frequency domain signals (X);
    - multiplying (22) said time/frequency domain signals with a downmix matrix Dl,m, followed by processing the resulting signals (Y) in a synthesis filter bank (24) that has an inverse operation of said analysis filter bank (21) and that provides M time domain output signals;
    - determining (23) from said time/frequency domain signals (X) MPEG SAOC side information data including Object Level Differences data OLD and Inter-Object Coherence data IOC, as well as enhanced Downmix Gains data DMG and Downmix Channel Level Differences data DCLD, wherein said DMG and DCLD data are related to M channels.
  2. Method for downmixing (32) spatial audio signals processed according to claim 1 in a manner different from the manner of downmixing at encoder side, wherein said downmixing at receiver side can be controlled (35) by desired playback configuration data and/or desired object positioning data, said method including the steps:
    - receiving said processed spatial audio signals and processing them in an analysis filter bank (31), providing corresponding time/frequency domain signals (Y);
    - determining (35) from said desired playback configuration data and/or said desired object positioning data a rendering matrix Al,m ;
    - determining (34) from the received OLD, IOC, DMG and DCLD data an estimated covariance matrix Cl,m and a reconstructed down-mixing matrix Dl,m ;
    - calculating (36) an estimation matrix T l , m = A l , m C l , m D l , m H D l , m C l , m D l , m H - 1 ;
    Figure imgb0026
    - multiplying (32) said time/frequency domain signals (Y) with said estimation matrix Tl,m so as to get desired-remix signals (), followed by processing said desired-remix signals in a synthesis filter bank (33) that has an inverse operation of said analysis filter bank (31).
  3. Apparatus for encoding by downmixing (22) spatial audio signals (Ch.#1 - Ch.#5, Obj.#1 - Obj.#L) that can be downmixed at receiver side in a manner different from the manner of downmixing at encoder side, wherein said encoding is based on MPEG SAOC and said downmixing at receiver side can be controlled (35) by desired playback configuration data and/or desired object positioning data, said apparatus including:
    - an analysis filter bank (21) for processing M correlated sound signals (Ch.#1 - Ch.#5), M being greater than '2', and L independent sound signals (Obj.#1 - Obj.#L), L being '1' or greater, providing corresponding time/frequency domain signals (X);
    - means (22) being adapted for multiplying said time/frequency domain signals (X) with a downmix matrix Dl,m ;
    - a synthesis filter bank (24) for said multiplied time/frequency domain signals (Y) that has an inverse operation of said analysis filter bank (21) and that provides M time domain output signals;
    - means (23) being adapted for determining from said time/frequency domain signals (X) MPEG SAOC side information data including Object Level Differences data OLD and Inter-Object Coherence data IOC, as well as enhanced Downmix Gains data DMG and Downmix Channel Level Differences data DCLD, wherein said DMG and DCLD data are related to M channels.
  4. Apparatus for downmixing (32) spatial audio signals processed according to claim 1 in a manner different from the manner of downmixing at encoder side, wherein said downmixing at receiver side can be controlled (35) by desired playback configuration data and/or desired object positioning data, said apparatus including:
    - means (31) being adapted for receiving said processed spatial audio signals and for processing them in an analysis filter bank, providing corresponding time/frequency domain signals (Y);
    - means (35) being adapted for determining from said desired playback configuration data and/or said desired object positioning data a rendering matrix Al,m ;
    - means (34) being adapted for determining from the received OLD, IOC, DMG and DCLD data an estimated covariance matrix Cl,m and a reconstructed down-mixing matrix Dl,m ;
    - means (36) being adapted for calculating an estimation matrix T l , m = A l , m C l , m D l , m H D l , m C l , m D l , m H - 1 ;
    Figure imgb0027
    - means (32) being adapted for multiplying said time/frequency domain signals (Y) with said estimation matrix Tl,m so as to get desired-remix signals (), followed by processing said desired-remix signals in a synthesis filter bank (33) that has an inverse operation of said analysis filter bank (31).
  5. Method according to the method of claim 1 or 2, or apparatus according to the apparatus of claim 3 or 4, wherein said spatial input signals are processed in non-overlapping multiple-sample temporal slots, a fixed number of such temporal slots representing a frame l, and in K frequency processing bands into which the total frequency range is divided, K having a value of '2', '3' or '4'.
  6. Method according to the method of claim 5, or apparatus according to the apparatus of claim 5, wherein said Downmix Gains data DMG and Downmix Channel Level Differences data DCLD are calculated for every input signal frame l and processing band m according to: DMG l , m r = 10 log 10 i = 1 5 D l , m 2 i r + ε DCLD l , m r j = 10 log 10 D l , m 2 1 r + ε D l , m 2 j r ) + ε ,
    Figure imgb0028

    where r=1:(M+L) represents an spatial audio input signal index, j=1:5 represents a down-mix channel index and value ε is used for avoiding a division by zero in other related computations.
  7. Method according to the method of claim 5 or 6, or apparatus according to the apparatus of claim 5 or 6, wherein said rendering matrix Al,m has a size of said downmix matrix Dl,m and its coefficients are based on the coefficients of matrix Dl,m, wherein each sound object is represented by a different row in matrix A l,m .
  8. Method according to the method of one of claims 5 to 7, or apparatus according to the apparatus of one of claims 5 to 7, wherein said estimated covariance matrix C l,m is calculated according to C l , m i j = OLD l , m i OLD l , m j IOC l , m i j ,
    Figure imgb0029
    and said reconstructed down-mixing matrix Dl,m is calculated according to D l , m 1 r = 10 DMG l , m r 2 1 1 + j = 2 5 10 - 0.1 * DCLD l , m r j D l , m i k = 10 DMG l , m k 2 10 - 0.1 * DCLD l , m r i 1 + j = 2 5 DCLD l , m r j ,
    Figure imgb0030

    where r=1:(M+L) represents a spatial audio input signal index, i=2:5 represents a down-mix channel index, and k is a running frequency sample index within a current frequency processing band.
  9. Method according to the method of one of claims 5 to 8, or apparatus according to the apparatus of one of claims 5 to 8, wherein from said OLD, IOC, DMG and DCLD data a matrix G l , m = D l , m C l , m D l , m H
    Figure imgb0031
    is calculated (25), in which Cl,m is a related covariance matrix and D l , m H
    Figure imgb0032
    represents the Hermitian notation of downmix matrix Dl,m, and wherein there from a rank value is calculated and it is determined whether matrix Gl,m has a full rank, and, if true, said side information data is encoded for transmission in said bitstream.
  10. Method according to the method of claim 9, or apparatus according to the apparatus of claim 9, wherein said rank value is used to control the number K of frequency bands used in the processing.
  11. Method according to the method of claim 9 or 10, or apparatus according to the apparatus of claim 9 or 10, wherein said rank value is used to switch on or off a residual coding.
  12. Digital audio signal that is encoded according to the method of one of claims 1, 5 and 6.
EP12305914.9A 2012-07-26 2012-07-26 Method and Apparatus for downmixing MPEG SAOC-like encoded audio signals at receiver side in a manner different from the manner of downmixing at encoder side Withdrawn EP2690621A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP12305914.9A EP2690621A1 (en) 2012-07-26 2012-07-26 Method and Apparatus for downmixing MPEG SAOC-like encoded audio signals at receiver side in a manner different from the manner of downmixing at encoder side

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP12305914.9A EP2690621A1 (en) 2012-07-26 2012-07-26 Method and Apparatus for downmixing MPEG SAOC-like encoded audio signals at receiver side in a manner different from the manner of downmixing at encoder side

Publications (1)

Publication Number Publication Date
EP2690621A1 true EP2690621A1 (en) 2014-01-29

Family

ID=47002786

Family Applications (1)

Application Number Title Priority Date Filing Date
EP12305914.9A Withdrawn EP2690621A1 (en) 2012-07-26 2012-07-26 Method and Apparatus for downmixing MPEG SAOC-like encoded audio signals at receiver side in a manner different from the manner of downmixing at encoder side

Country Status (1)

Country Link
EP (1) EP2690621A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015056383A1 (en) * 2013-10-17 2015-04-23 パナソニック株式会社 Audio encoding device and audio decoding device
CN106796804A (en) * 2014-10-02 2017-05-31 杜比国际公司 For talking with enhanced coding/decoding method and decoder
US9805727B2 (en) 2013-04-03 2017-10-31 Dolby Laboratories Licensing Corporation Methods and systems for generating and interactively rendering object based audio
CN108632048A (en) * 2017-03-22 2018-10-09 展讯通信(上海)有限公司 Conference telephone control method, device and mostly logical terminal
US10136240B2 (en) 2015-04-20 2018-11-20 Dolby Laboratories Licensing Corporation Processing audio data to compensate for partial hearing loss or an adverse hearing environment
CN109036441A (en) * 2014-03-24 2018-12-18 杜比国际公司 To the method and apparatus of high-order clear stereo signal application dynamic range compression
WO2019227991A1 (en) * 2018-05-31 2019-12-05 华为技术有限公司 Method and apparatus for encoding stereophonic signal
CN110739000A (en) * 2019-10-14 2020-01-31 武汉大学 Audio object coding method suitable for personalized interactive system
CN113678199A (en) * 2019-03-28 2021-11-19 诺基亚技术有限公司 Determination of the importance of spatial audio parameters and associated coding

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080049943A1 (en) * 2006-05-04 2008-02-28 Lg Electronics, Inc. Enhancing Audio with Remix Capability
US20110166867A1 (en) * 2008-07-16 2011-07-07 Electronics And Telecommunications Research Institute Multi-object audio encoding and decoding apparatus supporting post down-mix signal
US20120078642A1 (en) * 2009-06-10 2012-03-29 Jeong Il Seo Encoding method and encoding device, decoding method and decoding device and transcoding method and transcoder for multi-object audio signals
US20120177204A1 (en) * 2009-06-24 2012-07-12 Oliver Hellmuth Audio Signal Decoder, Method for Decoding an Audio Signal and Computer Program Using Cascaded Audio Object Processing Stages

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080049943A1 (en) * 2006-05-04 2008-02-28 Lg Electronics, Inc. Enhancing Audio with Remix Capability
US20110166867A1 (en) * 2008-07-16 2011-07-07 Electronics And Telecommunications Research Institute Multi-object audio encoding and decoding apparatus supporting post down-mix signal
US20120078642A1 (en) * 2009-06-10 2012-03-29 Jeong Il Seo Encoding method and encoding device, decoding method and decoding device and transcoding method and transcoder for multi-object audio signals
US20120177204A1 (en) * 2009-06-24 2012-07-12 Oliver Hellmuth Audio Signal Decoder, Method for Decoding an Audio Signal and Computer Program Using Cascaded Audio Object Processing Stages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GILBERT STRANG: "Singular Value Decomposition", 19 July 2005, article "Linear Algebra and its applications"

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11769514B2 (en) 2013-04-03 2023-09-26 Dolby Laboratories Licensing Corporation Methods and systems for rendering object based audio
US10276172B2 (en) 2013-04-03 2019-04-30 Dolby Laboratories Licensing Corporation Methods and systems for generating and interactively rendering object based audio
US11270713B2 (en) 2013-04-03 2022-03-08 Dolby Laboratories Licensing Corporation Methods and systems for rendering object based audio
US9805727B2 (en) 2013-04-03 2017-10-31 Dolby Laboratories Licensing Corporation Methods and systems for generating and interactively rendering object based audio
US10832690B2 (en) 2013-04-03 2020-11-10 Dolby Laboratories Licensing Corporation Methods and systems for rendering object based audio
US10553225B2 (en) 2013-04-03 2020-02-04 Dolby Laboratories Licensing Corporation Methods and systems for rendering object based audio
US10002616B2 (en) 2013-10-17 2018-06-19 Socionext Inc. Audio decoding device
WO2015056383A1 (en) * 2013-10-17 2015-04-23 パナソニック株式会社 Audio encoding device and audio decoding device
US9779740B2 (en) 2013-10-17 2017-10-03 Socionext Inc. Audio encoding device and audio decoding device
CN109036441B (en) * 2014-03-24 2023-06-06 杜比国际公司 Method and apparatus for applying dynamic range compression to high order ambisonics signals
US11838738B2 (en) 2014-03-24 2023-12-05 Dolby Laboratories Licensing Corporation Method and device for applying Dynamic Range Compression to a Higher Order Ambisonics signal
CN109036441A (en) * 2014-03-24 2018-12-18 杜比国际公司 To the method and apparatus of high-order clear stereo signal application dynamic range compression
CN106796804A (en) * 2014-10-02 2017-05-31 杜比国际公司 For talking with enhanced coding/decoding method and decoder
CN106796804B (en) * 2014-10-02 2020-09-18 杜比国际公司 Decoding method and decoder for dialog enhancement
US10136240B2 (en) 2015-04-20 2018-11-20 Dolby Laboratories Licensing Corporation Processing audio data to compensate for partial hearing loss or an adverse hearing environment
CN108632048A (en) * 2017-03-22 2018-10-09 展讯通信(上海)有限公司 Conference telephone control method, device and mostly logical terminal
CN108632048B (en) * 2017-03-22 2020-12-22 展讯通信(上海)有限公司 Conference call control method and device and multi-pass terminal
US11462224B2 (en) 2018-05-31 2022-10-04 Huawei Technologies Co., Ltd. Stereo signal encoding method and apparatus using a residual signal encoding parameter
US11978463B2 (en) 2018-05-31 2024-05-07 Huawei Technologies Co., Ltd. Stereo signal encoding method and apparatus using a residual signal encoding parameter
WO2019227991A1 (en) * 2018-05-31 2019-12-05 华为技术有限公司 Method and apparatus for encoding stereophonic signal
CN113678199A (en) * 2019-03-28 2021-11-19 诺基亚技术有限公司 Determination of the importance of spatial audio parameters and associated coding
CN110739000A (en) * 2019-10-14 2020-01-31 武汉大学 Audio object coding method suitable for personalized interactive system

Similar Documents

Publication Publication Date Title
EP2690621A1 (en) Method and Apparatus for downmixing MPEG SAOC-like encoded audio signals at receiver side in a manner different from the manner of downmixing at encoder side
US9257128B2 (en) Apparatus and method for coding and decoding multi object audio signal with multi channel
US9578435B2 (en) Apparatus and method for enhanced spatial audio object coding
JP5189979B2 (en) Control of spatial audio coding parameters as a function of auditory events
JP5592974B2 (en) Enhanced coding and parameter representation in multi-channel downmixed object coding
EP2941771B1 (en) Decoder, encoder and method for informed loudness estimation employing by-pass audio object signals in object-based audio coding systems
US8867753B2 (en) Apparatus, method and computer program for upmixing a downmix audio signal
US8515759B2 (en) Apparatus and method for synthesizing an output signal
CN105518775B (en) Artifact cancellation for multi-channel downmix comb filters using adaptive phase alignment
KR101798117B1 (en) Encoder, decoder and methods for backward compatible multi-resolution spatial-audio-object-coding
US11501785B2 (en) Method and apparatus for adaptive control of decorrelation filters
MX2012005781A (en) Apparatus for providing an upmix signal represen.
EP2439736A1 (en) Down-mixing device, encoder, and method therefor
WO2023172865A1 (en) Methods, apparatus and systems for directional audio coding-spatial reconstruction audio processing
RU2485605C2 (en) Improved method for coding and parametric presentation of coding multichannel object after downmixing

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20140730