WO2020177981A1 - Codage audio spatialisé avec interpolation et quantification de rotations - Google Patents
Codage audio spatialisé avec interpolation et quantification de rotations Download PDFInfo
- Publication number
- WO2020177981A1 WO2020177981A1 PCT/EP2020/053264 EP2020053264W WO2020177981A1 WO 2020177981 A1 WO2020177981 A1 WO 2020177981A1 EP 2020053264 W EP2020053264 W EP 2020053264W WO 2020177981 A1 WO2020177981 A1 WO 2020177981A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- matrix
- frame
- channels
- current frame
- rotation matrix
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/032—Quantisation or dequantisation of spectral components
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/002—Dynamic bit allocation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/06—Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients
Definitions
- the present invention relates to the coding / decoding of spatialized sound data, in particular in a surround sound context (hereinafter also referred to as “ambisonic”).
- the coders / decoders (hereinafter called “codecs”) which are currently used in mobile telephony are mono (a single signal channel for reproduction on a single loudspeaker).
- the 3GPP EVS (for “Enhanced Voice Services”) codec makes it possible to offer “Super-HD” quality (also called “High Definition +” or HD + voice) with an audio band in super-wide band (SWB for “super- wideband "in English) for signals sampled at 32 or 48 kHz or full band (FB for" Fullband ”) for signals sampled at 48 kHz; the audio bandwidth is 14.4 to 16 kHz in SWB mode (9.6 to 128 kbit / s) and 20 kHz in FB mode (16.4 to 128 kbit / s).
- the next quality development in conversational services offered by operators should be immersive services, using terminals such as smartphones, for example, equipped with several microphones or spatialized audio conferencing or tele-presence type videoconferencing equipment.
- - Object-based format where sound objects are described as an audio signal (generally mono) associated with metadata describing the attributes of this object (position in space, spatial width of the source, etc. ), and - Ambisonic format (scene-based in English) which describes the sound field at a given point, generally picked up by a spherical microphone or synthesized in the field of spherical harmonics.
- a sound in ambisonic format by way of example of an embodiment (at least certain aspects presented in connection with the invention below can also be applied to other formats. than ambisonics).
- Ambisonics is a recording method ("encoding” in the acoustic sense) of spatialized sound and a reproduction system (“decoding” in the acoustic sense).
- An ambisonic microphone (at order 1) comprises at least four capsules (typically of the cardoid or sub-cardoid type) arranged on a spherical grid, for example the vertices of a regular tetrahedron.
- the audio channels associated with these capsules are called “A-format”. This format is converted into a “B-format”, in which the sound field is broken down into four components (spherical harmonics) denoted W, X, Y, Z, which correspond to four coincident virtual microphones.
- the W component corresponds to an omnidirectional capture of the sound field while the X, Y and Z components, which are more directive, are comparable to pressure gradients oriented along the three dimensions of space.
- An ambisonic system is a flexible system in the sense that recording and playback are separate and decoupled. It allows decoding (in the acoustic sense) on any speaker configuration (for example, binaural, 5.1-type surround sound or 7.1.4-type periphery (with elevation)).
- the ambisonic approach can be generalized to more than four channels in B-format and this generalized representation is commonly called “HOA” (for “Higher-Order Ambisonics”).
- FOA First-Order Ambisonics
- the first order ambisonics (4 channels: W, X, Y, Z) and the first order planar ambisonics (3 channels: W, X, Y) are hereinafter referred to as “ambisonics” indiscriminately to facilitate reading, the treatments presented being applicable regardless of planar type or not. If, however, in some passages it is necessary to make a distinction, the terms “first-order ambisonics” and “first-order planar ambisonics” are used.
- a stereo signal (2 channels) corresponding to coincident stereo pickups of the Blumlein Crossed Pair (X + Y and XY) or Mid-Side type (by combining W and X for Mid and taking Y as Side).
- a B-format signal with a predetermined order is called “ambisonic sound”.
- the ambisonic sound can be defined in another format such as A-format or pre-combined channels by fixed matrixing (keeping the number of channels or reducing it to a 3 or 2 channel case), as will be seen.
- the signals to be processed by the encoder / decoder are presented as successions of blocks of sound samples called “frames” or “sub-frames” below.
- mathematical notations follow the following convention:
- FIG. 1 A (uppercase, bold)
- the simplest approach to encoding a stereo or ambisonic signal is to use a mono encoder and apply it in parallel to all channels with possibly a different bit allocation depending on the channel. This approach is called here "multi-mono" (although in practice we can generalize the approach to multi-stereo or the use of several parallel instances of the same core codec).
- One such embodiment is shown in Figure 1.
- the input signal is divided into channels (mono) by block 100. These channels are individually coded by blocks 120 through 122 according to a predetermined allocation. Their binary train is multiplexed (block 130) and after transmission and / or storage it is demultiplexed (block 140) to apply a decoding of each of the channels (blocks 150 to 152) which are recombined (block 160).
- the solutions currently proposed for more sophisticated codecs, for ambisonic spatialization in particular, are not satisfactory, in particular in terms of complexity, delay and efficient use of the bit rate, to ensure efficient decorrelation between ambisonic channels.
- the MPEG-H codec for ambisonic sounds uses an add-overlap operation which adds delay and complexity, as well as linear interpolation on direction vectors which is suboptimal and introduces defects.
- a basic problem with this codec is that it implements a decomposition into predominant components and ambience because the predominant components are supposed to be perceptually distinct from ambience, but this decomposition is not fully specified.
- the MPEG-H encoder suffers from the problem of non-correspondence between the directions of the principal components from one frame to another: the order of the components (signals) can be swapped just like the associated directions. This is the reason why the MPEG-H codec uses a “matching” and overlap-add technique in order to solve this problem.
- the present invention makes it possible to improve a decorrelation between the N channels to be encoded separately subsequently.
- This separate encoding is hereinafter also referred to as “multi-mono encoding”.
- the method may further include:
- the method may further include:
- the method further comprises:
- the method may further comprise:
- a number of interpolations to be made between the rotation matrix of the current frame and the rotation matrix of the previous frame is determined, - the current frame is divided into a number of sub- frames corresponding to the number of interpolations to operate, and
- each interpolation for a current sub-frame is a linear spherical interpolation (or “SLERP”), carried out as a function of the interpolation of the sub-frame preceding the sub-frame. current frame and from the quaternions of the previous subframe.
- SLERP linear spherical interpolation
- QR, t ⁇ 1 is the other of the quaternions of the previous subframe t-1,
- QR, t is the other of the quaternions of the current subframe t
- the search for the eigenvectors is performed by principal component analysis (or "PCA”) or by Karhunen Loeve transform (or "KLT”), in the time domain.
- PCA principal component analysis
- KLT Karhunen Loeve transform
- the method includes a preliminary step of forecasting the budget for allocation of bits per ambisonic channel, comprising:
- the present invention also relates to a method of decoding sound signals forming a succession in time of frames of samples, in each of N channels in ambisonic representation of order greater than 0, the method comprising:
- the present invention is also aimed at a coding device comprising a processing circuit for implementing the coding method presented above. It also relates to a decoding device comprising a processing circuit for implementing the above decoding method. It also relates to a computer program comprising instructions for the implementation of the above method, when these instructions are executed by a processor of a processing circuit.
- FIG. 3 shows the general structure of an example of an encoder according to the invention
- FIG. 4 shows details of the analysis and the PCA / KLT transformation carried out by block 310 of the encoder of Figure 3,
- FIG. 5 shows an example of a decoder according to the invention
- FIG. 7 illustrates examples of structural embodiments of an encoder and of a decoder within the meaning of the invention.
- the invention aims to allow an optimized coding by:
- PCA designating a principal component analysis
- KLT designating a Karhunen Loeve transform
- Adaptive matrixing allows more efficient channelization than fixed matrixing.
- the matrixing according to the invention advantageously makes it possible to decorrelate the channels before multi-mono coding, so that the coding noise introduced by the coding of each of the channels globally distorts the spatial image as little as possible when the channels are recombined to reconstruct a ambisonic signal on decoding.
- the invention makes it possible to ensure a gentle adaptation of the matrixing parameters in order to avoid "click" type artefacts at the edge of the frame or too rapid spatial image fluctuations, or even coding artefacts due to too strong variations (for example linked to untimely permutations of sound sources between channels) in the various individual channels resulting from the mastering which are then coded by different instances of a mono codec.
- Multi-mono coding is presented below with preferentially variable allocation of bits between channels (after adaptive matrixing), but in variants several instances of a stereo or other core codec can be used.
- certain explanatory concepts concerning rotations in dimension n decompositions of the PCA / KLT or SVD type ("SVD" denoting a decomposition into singular values) are recalled below.
- the invention uses a representation of the rotations in dimension with parameters suitable for a quantization by frame and especially an efficient interpolation by subframe.
- the representations of rotations used in dimension 2, 3 and 4 are defined below.
- a rotation (around the origin) is a transformation of space into dimension that changes one vector to another vector, such as:
- I n designates the identity matrix of size nxn (i.e. M is a unit matrix, M T designating the transpose of M) and its determinant is +1.
- a rotation matrix of size 3x3 can be decomposed into a product of 3 elementary angle rotations! along the x, y, or z axes.
- the SLERP interpolation method (for "spherical linear interpolation") consists of interpolating according to the formula: where 0 £ a £ 1 is the interpolation factor to go from q 1 to q 2 and ⁇ is the angle between the two quaternions: where q 1 .
- q 2 denotes the dot product between two quaternions (identical to the dot product between two vectors of dimension 4). This amounts to interpolating by following a large circle on a 4D sphere with a constant angular speed as a function of ⁇ . It should be ensured that the shortest path is used for the interpolant by changing the sign of one of the quaternions when q 1 . q 2 ⁇ 0. Note that other quaternion interpolation methods can be used (normalized linear interpolation or nlerp, splines,).
- Singular value decomposition (or "SVD") Singular value decomposition (SVD) consists in factoring a real matrix A of size mxn in the form:
- p min ( m, n)
- V T is the transpose of V.
- the coefficients s i in the diagonal of R are the singular values of the matrix P By convention, they are generally listed in decreasing order, and in this case the diagonal matrix R associated with P is unique.
- the rank r of A is given by the number of non-zero coefficients s I. We can therefore rewrite the decomposition in singular values as:
- the eigenvalues of S T S and SS T are s columns of U are the vectors
- the SVD can be interpreted geometrically: the image of a sphere in dimension by the matrix A is in dimension m a hyper-ellipse having main axes in the directions u 1 , u 2 ,..., u m and of length s 1 ,..., s m .
- Karhunen Loeve transform (or “KLT” for “Karhunen Loeve Transform”)
- KLT Karhunen Loeve transform
- ⁇ S is the eigenvector matrix (with the convention that eigenvectors are column vectors) obtained by eigenvalue decomposition of s tt
- L diag (l 1 ,..., l n ) is a diagonal matrix whose coefficients are the eigenvalues.
- the matrix V [v 1 , v 2 ,..., v n ] contains the eigenvectors (columns) of R xx , such that
- Principal component analysis is a dimensionality reduction technique which produces orthogonal variables and maximizes the variance of the variables after projection (or in an equivalent manner minimize the reconstruction error).
- PCA Principal component analysis
- the PCA is a transformation by the matrix S T which projects the data into a new basis to maximize the variance of the variables after projection.
- the PCA can also be obtained from an SVD of the signal ⁇ ⁇ put in the form of a matrix ⁇ of size nx N. In this case, we can write:
- PCA is generally seen as a dimensionality reduction technique, to "compress" a large-dimensional dataset into a set. comprising few principal components.
- the PCA advantageously makes it possible to decorrelate the multidimensional input signal but one avoids eliminating channels (therefore reducing the number of channels) in order to avoid introducing artefacts.
- a minimum encoding rate is thus forced to avoid "truncating" the spatial image, except in specific variants where eigenvalues are so low that a zero rate can be authorized (for example to better encode ambisonic sounds created artificially. with a single synthetically spatialized source).
- FIG. 2 describes the general principles of the steps which are implemented in a method within the meaning of the invention, for a current frame t.
- Step S1 consists in obtaining the respective signals of the ambisonics channels (here four channels W, Y, Z, X in the example described using a channel order according to the ACN convention for Ambisonics Channel Number), for each frame t. These signals can be put in the form of an n x L matrix (for n ambisonic channels (here 4) and L samples per frame).
- the signals from these channels can optionally be pre-processed, for example by a high-pass filter as described below with reference to FIG. 3.
- step S3 we apply to these signals a PCA principal component analysis or in an equivalent way a Karhunen Loeve KLT transform, to obtain eigenvalues and an eigenvector matrix from a covariance matrix of the n canals.
- a PCA principal component analysis or in an equivalent way a Karhunen Loeve KLT transform, to obtain eigenvalues and an eigenvector matrix from a covariance matrix of the n canals.
- an SVD could be used.
- step S4 this matrix of eigenvectors, obtained for the current frame t, undergoes signed permutations so that it is as aligned as possible with the matrix of the same nature of the previous frame t-1.
- Such an embodiment makes it possible to ensure maximum consistency between the two matrices and thus avoid audible clicks between two frames during sound reproduction.
- the determinant of the eigenvector matrix of the current frame t must be positive and equal to (or, in practice, close to) + 1 in step S6. If it is equal to (or close to) -1, then it is necessary to: - swap two eigenvectors again (for example associated with channels of low energy, therefore not very representative), or
- step S6 preferably inverting the sign of all the elements of a column (for example associated with a low energy channel) in step S6.
- Parameters of this matrix can then be coded on a number of bits allocated for this purpose) at l 'step S8.
- a significant difference greater than a threshold for example
- Step S9 in the case where a significant difference (greater than a threshold for example) is observed in step S9 between the rotation matrix estimated for the current frame t and the rotation matrix of the frame previous t-1, it is possible to determine a variable number of interpolation sub-frames: otherwise, this number of sub-frames is fixed at a predetermined value.
- Step S10 consists of:
- step S11 the interpolated rotation matrices are applied to a matrix n X (L / K) representing each of the K sub-frames of the signals of the ambisonic channels of step S1 (or optionally S2) in order to decorrelate as much as possible these signals before the multi-mono encoding of step S14. It is recalled in fact that it is desired to de-correlate as much as possible these signals before this multi-mono transformation, according to a general approach. A bit allocation to the separate channels is made in step S12 and encoded in step S13.
- step S14 before carrying out the multiplexing of step S15 and thus ending the compression coding method, it is possible to decide on a number of bits to be allocated per channel as a function of the representativeness of this channel and of the speed available on the RES network (figure 7).
- the energy in each channel is estimated for a current frame and this energy is multiplied by a predefined score for this channel and for a given bit rate (this score being for example an MOS score explained below with reference to figure 3).
- the number of bits to be allocated for each channel is thus weighted.
- Such an embodiment is advantageous as such and may possibly be the subject of separate protection in an ambisonic context.
- the DCOD coding device comprises a processing circuit typically including:
- an interface INT1 for receiving ambisonic signals distributed over different channels for example four channels W, Y, Z, X in order 1
- channels W, Y, Z, X in order 1 for example four channels W, Y, Z, X in order 1
- a view to their coding in compression within the meaning of the invention
- processor PROC1 for receiving these signals and processing them by executing the computer program instructions stored in the memory MEM1, with a view to their coding;
- the DDEC decoding device has its own processing circuit, typically including:
- a memory MEM2 for storing instruction data of a computer program within the meaning of the invention (these instructions can be distributed between the DCOD encoder and the DDEC decoder as indicated above);
- an interface COM2 for receiving the encoded signals from the RES network with a view to their compression decoding within the meaning of the invention
- a processor PROC2 for processing these signals by executing the computer program instructions stored in the memory MEM2, with a view to their decoding
- FIG. 7 illustrates an example of a structural embodiment of a codec (encoder or decoder) within the meaning of the invention.
- FIG. 3 to describe an encoder device within the meaning of the invention.
- the encoder's strategy is to decorrelate the channels of the ambisonic signal as much as possible and to encode them with a core codec. This strategy makes it possible to limit the artefacts in the decoded ambisonic signal.
- the latter can typically be an extension of the standardized 3GPP EVS (for “Enhanced Voiced Services”) encoder.
- EVS coding rates can be used without then modifying the structure of the EVS binary train.
- the multi-mono coding (block 340 of FIG. 3 described below) operates here with a possible allocation to each transformed channel, restricted to the following rates for coding in super-wide audio band: 9.6; 13.2; 16.4; 24.4; 32; 48; 64; 96 and 128 kbit / s.
- the block 300 receives an input signal Y in the current frame of index t.
- the index is not indicated here so as not to weigh down the ratings.
- This is a matrix of size nx L.
- W, Y, Z, X (thus defined according to the order ACN) which can be normalized according to the SN3D convention.
- the order of the channels can be alternately, for example W, X, Y, Z (following the FuMA convention) and the normalization can be different (N3D or FuMa).
- This is therefore a succession of samples from 1 to L occupying frame t.
- the block 300 of the encoder applies a pre-processing (optional) to obtain the pre-processed input signal denoted Y.
- a pre-processing may be a high-pass filtering (with a cutoff frequency typically at 20 Hz) of each new one. 20 ms frame of the input signal channels. This operation removes the DC component likely to bias estimating the covariance matrix so that at the output of block 300 the signal can be considered to have zero mean.
- H pre H pre (Z) Y i (Z).
- a Butterworth filter of order 6 with a frequency of 50 Hz can be used, for example a Butterworth filter of order 6 with a frequency of 50 Hz.
- the pre-processing could include a fixed die-stamping step which could keep the same number of channels or reduce the number of channels.
- An example of matrixing applied to the four channels of an ambisonic signal in B-format is given below:
- this preprocessing will have to be reversed on decoding by applying a matrixing by u decoded signal to find the channels in the original format.
- the following block 310 estimates at each frame t a transformation matrix obtained by determining the eigenvectors by PCA / KLT and checking that the transformation matrix formed by these eigenvectors indeed characterizes a rotation. Details of the block operation 310 are given further on with reference to FIG. 4. This transformation matrix performs a matrixing of the channels to de-correlate them making it possible to apply an independent coding of the multi-mono type by the block. 340.
- Block 310 transmits to the multiplexer quantization indices representing the transformation matrix and optionally information encoding the number of interpolations of the transformation matrix, per sub-frame of the current frame t, as detailed further below.
- Block 320 determines the optimal rate allocation for each channel (after PCA / KLT transformation) as a function of a given B-bit budget. This block seeks a distribution of the bit rate between channels by calculating a score for each possible combination of bit rates; the optimal allocation is found by looking for the combination maximizing this score.
- the number of possible bit rates for mono encoding of one channel may be limited to the nine discrete bit rates of the EVS codec having super-wide audio band: 9.6; 13.2; 16.4; 24.4; 32; 48; 64; 96 and 128 kbit / s.
- the codec according to the invention operates at a given bit rate associated with a budget of B bits in the current frame of index t, in general only a subset of these listed bit rates can be used.
- B overhead corresponds to the bit budget for the additional information encoded per frame (binary allocation + rotation data) as described later.
- B multimono 50.05 kbit / s.
- the block 320 can then evaluate all the possible (relevant) combinations of bit rates for the 4 channels resulting from the PCA / KLT transformation (at the output of the block 310) and attribute a score to them. This score is calculated based on:
- MOS for "Mean Opinion Score", being an average score on a panel of testers
- the optimal allocation can be such that:
- the factor E i can be fixed at the value taken by the eigenvalue associated with channel 1 resulting from the decomposition into eigenvalues of the signal at the input of block 310 and after possible signed permutation.
- the subjective (average) MOS scores of an EVS standardized encoder given by:
- MOS score values for each of the listed bit rates can be derived from other tests (subjective or objective) predicting the quality of the codec. It is also possible to adapt the MOS notes used in the current frame according to a classification of the type of signal (for example a speech signal without background noise, or speech with ambient noise, or music or mixed content), by reusing classification methods implemented by the EVS codec and by applying them to the W channel of the ambisonic input signal before performing the binary allocation.
- the MOS score can also correspond to an average score resulting from different types of methodologies and rating scales: MOS (absolute) from 1 to 5, DMOS (from 1 to 5), MUSHRA (from 0 to 100).
- the list of bit rates 0 ⁇ and the notes Q (b i ) can be replaced as a function of this other codec. It is also possible to add additional coding rates to the EVS encoder and therefore complete the list of rates and MOS notes, or even modify the EVS encoder and potentially the associated MOS notes.
- the allocation between the channels is refined by weighting the energy by a power a where a takes a value between 0 and 1.
- a second weighting can be added to the score function to penalize inter-frame rate changes.
- a penalty is added to the score if the rate combination is not the same in frame ⁇ as in frame t - 1.
- the score is then expressed in the form:
- the combination of the 4 bit rates can be coded in the form of the index: However, we can prefer to enumerate (initially, offline)
- the index can then be represented by a coding of the type "permutation code” + “offset of the combination”; for example in the example where we code on a 4-bit index the 16 bit rate combinations comprising 4 permutations of (13.2, 13.2, 13.2, 9.6) and 12 permutations of (16.4, 13.2, 9.6, 9.6), we can use indices 0-3 to code the first 4 possible permutations (with an offset of 0 and a code ranging from 0 to 3) and the indices 4-15 to code the 12 other possible permutations (with an offset of 4 and a code of 0 to 11). Referring again to FIG.
- the multiplexing block 350 takes as input the matrixed channels coming from the block 310 and the bit rates allocated to each channel coming from the block 320 to then separately code the different channels with a core codec which corresponds to the codec. EVS for example. If the core codec used allows stereo or multi-channel coding, the multi-mono approach can be replaced by multi-stereo or multi-channel coding. Once the channels are coded, the associated bit stream is sent to the multiplexer (block 350).
- the multiplexer (block 350) can add zero stuffing bits to reach the bit budget allocated to the current frame, that is, in the variants, the budget of bits remaining can be
- the specified 3GPP EVS encoding algorithm can be modified to introduce additional bit rates. In this case, it is also possible to integrate these additional rates in the table defining the correspondence between b i and Q (b i ).
- a bit can also be reserved in order to be able to switch between two coding modes:
- this matrix can be replaced by the correlation matrix, where the channels are pre-normalized by their respective standard deviation, or generally weights reflecting a relative importance can be applied to each of the channels; moreover, the normalization term 1 / (L - 1) can be omitted or replaced by another value (for example 1 / ⁇ ).
- the values C ij correspond to the variance between x i and x j .
- the encoder then performs in block 410 an eigenvalue decomposition (EVD for “Eigenvalue Decomposition”), by calculating the eigenvalues and the eigenvectors of the matrix C.
- EDD eigenvalue Decomposition
- the eigenvectors are denoted here V t to indicate the frame index t because the eigenvectors V t - 1 obtained in the previous frame of index t - 1 are preferably stored and used subsequently.
- the eigenvalues are noted l 1 , l 2 ,..., l n .
- the encoder then applies in block 420 a first signed permutation of the columns of the transformation matrix for frame t (whose columns are the vectors clean) in order to avoid too much disparity with the transformation matrix of the previous frame t-1, which would generate click problems at the border with the previous frame.
- the eigenvectors of frame t are permuted so that the associated basis are as close as possible to the basis of frame t - 1. This has the effect of improving the continuity of the transformed signal frames (once the transformation matrix is applied to the channels).
- Another constraint is that the transformation matrix must correspond to a rotation.
- the encoder can convert the transformation matrix into generalized Euler angles (block 430) in order to quantize them (block 440) with a predetermined bit budget as seen previously.
- the determinant of this matrix must be positive (equal to +1 typically).
- the optimal signed permutation is obtained in two steps:
- the first step matches the closest vectors between two frames, looking only at the axis and not the direction (sense) of the axis.
- This problem can be formulated as a combinatorial task assignment problem, where the objective is to find the configuration which minimizes a cost.
- the cost can be defined here as the trace of the absolute value of the inter-correlation between the eigenvector matrices of frames t and t - 1.
- the "Hungarian” method (or “Hungarian algorithm”) is used to determine the optimal assignment which gives a permutation of the eigenvectors of the frame t;
- the second step (S6 in FIG. 2) consists in determining the direction / direction of each permuted eigenvector.
- the block 420 calculates the inter-correlation between the permuted eigenvectors Sa of the frame t and the eigenvector of the frame t - 1
- the transformation matrix at frame t is designated by V t so that at the following frame the stored matrix becomes V t - 1.
- the search for the optimal signed permutation can be done by calculating the passage matrix is converted to 3D or 4D and converting this pass matrix respectively into a unit quaternion or two unit quaternions. The search then becomes a search for the nearest neighbor with a dictionary representing the set of possible signed permutations. For example in the 4D case the twelve possible even permutations (out of 24 total permutations) of 4 values are associated with the following double unit quaternions written as 4D vectors:
- the search for the optimal permutation (pair) can be done by using the above list as a pre-defined double quaternion dictionary and performing a closest neighbor search against the double quaternion associated with the passage matrix.
- An advantage of this method is to reuse the quaternion and double quaternion type rotation parameters.
- the transformation matrix resulting from blocks 410 and 420 is an orthogonal matrix (unitary ) which can have a determinant at -1 or 1, ie a reflection or rotation matrix. If the transformation matrix is a reflection matrix (if its determinant is equal to -1), it can be modified into a rotation matrix by inverting an eigenvector (for example the eigenvector associated with the lowest value) or by inverting two columns (eigenvectors).
- Block 430 converts the rotation matrix into parameters.
- an angular representation is used for the quantification (6 generalized Euler angles for the 4D case, 3 Euler angles for the 3D case, and one 2D angle).
- ambisonic case four channels we obtain six Euler angles generalized according to the method described in the article “Generalization of Euler Angles to N-Dimensional Orthogonal Matrices” by David K. Hoffman, Richard C.
- the quantization indices of the transformation matrix are sent to the multiplexer (block 350).
- block 440 will be able to convert the quantized parameters into a quantized rotation matrix.
- the parameters used for quantization do not match the parameters used for interpolation.
- the blocks 430 and 440 can be replaced as follows: -
- the unit quaternions q1, q2 (4D case) and q (3D case) can be converted into axis-angle representations known to the state of the art.
- - Block 440 can perform quantization in the indicated domain:
- the encoder interpolates in block 460 the (quantized) representation of the rotation between the current frame and the previous frame to avoid excessively rapid fluctuations of the different channels after transformation.
- the number of interpolations can be fixed (equal to a predetermined value) or adaptive.
- Each frame is then divided into sub-frames as a function of the number of interpolations determined in the block 450.
- the block 450 can code on a chosen number of bits the number of interpolations to be performed, and therefore the number of subframes to be provided, in the case where this number is determined adaptively; in the case of a fixed interpolation, no information is to be coded.
- block 460 converts the rotation matrices to a specific domain representing a rotation matrix. The frame is divided into sub-frames, and in the chosen domain the interpolation is performed for each sub-frame.
- the encoder For an ambisonic input signal of order 1 (with 4 channels W, X, Y, Z), in block 460, the encoder reconstructs from the 6 quantized Euler angles a quantized 4D rotation matrix and that- ci is then converted to two unit quaternions for interpolation purposes.
- the input of the encoder is a planar ambisonic signal (3 channels W, X, Y)
- in block 460 the encoder reconstructs from the 3 quantized Euler angles a quantized 3D rotation matrix and the latter is then converted to a unitary quaternion for interpolation purposes.
- the encoder input is a stereo signal
- the encoder uses in block 460 the representation of the 2D rotation quantized with a rotation angle.
- the rotation matrix calculated for the frame ⁇ is factored into 2 quaternions (a double quaternion) thanks to the Cayley factorization and we use the double quaternion stored for the previous frame t-1 and noted (Q L, t-1 , Q R, t-1 ).
- the quaternions two by two are interpolated in each sub-frame.
- the block determines the shortest path between the two possible (Q L, t or -Q L, t ).
- the sign of the quaternion of the current frame is reversed.
- the rotation matrix of 4x4 dimension is calculated (respectively 3x3 for planar ambisonics or 2x2 for the stereo case).
- the quaternion and antiquaternion matrices are calculated and the matrix product is calculated.
- interpolation are then used in the transformation block 470 which produces channels transformed by applying the rotation matrices thus found, to the ambisonic channels which have been preprocessed by the block 300.
- K the number of sub- frames to be determined in block 450 for the case where this number is adaptive. It is measured the final difference between the current frame and the previous frame or directly from the angular difference of the parameters describing the rotation matrix. In the latter case, an attempt is made to ensure that the angular variation between successive sub-frames is not perceptible.
- the realization of an adaptive number of subframes is especially advantageous for reducing the average complexity of the codec, but if it is chosen to reduce the complexity, it may be preferable to use an interpolation with a fixed number of subframes.
- the final difference between the corrected frame rotation matrix t and the frame rotation matrix t - 1 gives a measure of the magnitude of the difference in channel mastering between the two frames.
- the larger this gap the greater the number of subframes for the interpolation made in block 460.
- I n is the identity matrix
- V t the vectors specific to the frame of index t
- M ⁇ a norm of the matrix M which corresponds here to the sum of the absolute values of all the coefficients.
- Other matrix standards can be used (for example the Frobenius standard). If the two matrices are identical then this difference is equal to 0. The more the matrices are dissimilar, the greater the value of the difference d t .
- Predetermined thresholds can be applied to d t , with each threshold is associated a predefined number of interpolations, for example according to the following decision logic: Thresholds: ⁇ 4.0, 5.0, 6.0, 7.0 ⁇
- Number K of subframes for interpolation ⁇ 10, 48, 96, 192 ⁇
- the number K of interpolations determined by the block 450 is then sent to the interpolation module 460 and in the adaptive case the number of subframes is encoded in the form of a binary index which is sent to the multiplexer (block 350) .
- the realization of the interpolation makes it possible to apply in fine an optimization of the decorrelation of the input channels before multi-mono coding.
- the rotation matrices calculated respectively for a previous frame t-1 and a current frame t may be very different due to this search for decorrelation, but the interpolation nevertheless makes it possible to smooth this difference.
- FIG. 5 describes a decoder in an exemplary embodiment of the invention.
- the allocation information is decoded (block 510) which makes it possible to de-multiplex and decode (block 520) the binary train (s) (s) received for each of the transformed channels.
- Block 520 calls for multiple instances executed separately from core decoding.
- the core decoding can be of the EVS type optionally modified to improve its performance.
- each channel is decoded separately. If the Previously used encoding is stereo or multi-channel encoding, the multi-mono approach can be replaced by multi-stereo or multi-channel for decoding.
- the channels thus decoded are sent to block 530 which decodes the rotation matrix for the current frame and optionally the number K of subframes to be used for the interpolation (if the interpolation is adaptive).
- the interpolation block 460 splits the frame into sub-frames, the number K of which can be read in the stream encoded by the block 610 (figure 6) and interpolates the rotation matrices, the goal being to find - in the absence of transmission errors - the same matrices as in block 460 of the encoder in order to be able to reverse the transformation which was previously done in block 470.
- Block 530 performs the matrixing inverting that of block 470 to reconstruct a decoded signal , as detailed below with reference to FIG. 6. This matrixing amounts to calculating by sub-frame corresponds to the successive sub-blocks of size nx
- Block 530 globally performs the decoding and reverse PCA / KLT synthesis which has been performed by block 310 of Figure 3.
- the quantization indices of the rotation quantization parameters in the current frame are decoded in block 600. Scalar quantization can be used and the quantization step is identical for each angle.
- the number of interpolation sub-frames is decoded (block 610) to find the number K of sub-frames among the set ⁇ 10, 48, 96, 192 ⁇ ; in variants where the length of frames L is different, this set of values may be adapted.
- the interpolation of the decoder is identical to that performed at the encoder (block 460).
- the block 620 performs the reverse matrixing of the ambisonic channels by subframe using the inverses (the transposed in practice) of the transformation matrices calculated in the block 460.
- the invention uses a completely different approach than the MPEG codec. -H with addition / recovery based on a specific representation of the transformation matrices which are restricted to rotation matrices from one frame to another, in the time domain, allowing in particular an interpolation of the transformation matrices, with a mapping that ensures consistency in direction (including taking into account the meaning by sign).
- the general approach of the invention is a coding of ambisonic sounds in the time domain by PCA with in particular PCA transformation matrices forced to be rotation matrices and interpolated by sub-frames in an optimized manner (in particular in the field of quaternions / double quaternions) to improve the quality.
- the interpolation step is either fixed or adaptive as a function of a criterion of difference between an inter-correlation matrix and a reference matrix (identity) or between matrices to be interpolated.
- the quantization of the rotation matrices can be implemented in the domain of generalized Euler angles. However, it may be preferentially chosen to quantify the matrices of dimension 3 and 4 in the domain of quaternions and double quaternions (respectively), which makes it possible to remain in the same domain for the quantization and the interpolation.
- eigenvector alignment is used to avoid the problems of clicks and channel inversion from frame to frame.
- the present invention is not limited to the embodiments described above by way of example and extends to other variants.
- the foregoing description has dealt with the cases of four channels.
- the transformation matrices at frames t - 1 and ⁇ are denoted V t - 1 and V t .
- the interpolation can be performed with a factor between V t - 1 and t el Vt that:
Abstract
Description
Claims
Priority Applications (8)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/436,390 US11922959B2 (en) | 2019-03-05 | 2020-02-10 | Spatialized audio coding with interpolation and quantization of rotations |
CN202080031569.8A CN113728382A (zh) | 2019-03-05 | 2020-02-10 | 利用旋转的插值和量化进行空间化音频编解码 |
EP20703048.7A EP3935629A1 (fr) | 2019-03-05 | 2020-02-10 | Codage audio spatialisé avec interpolation et quantification de rotations |
BR112021017511A BR112021017511A2 (pt) | 2019-03-05 | 2020-02-10 | Codificação de áudio espacializada com interpolação e quantização de rotações |
JP2021552656A JP7419388B2 (ja) | 2019-03-05 | 2020-02-10 | 回転の補間と量子化による空間化オーディオコーディング |
KR1020217031995A KR20210137114A (ko) | 2019-03-05 | 2020-02-10 | 회전들의 보간 및 양자화를 통한 공간화된 오디오 코딩 |
ZA2021/06465A ZA202106465B (en) | 2019-03-05 | 2021-09-03 | Spatialized audio coding with interpolation and quantification of rotations |
JP2024001364A JP2024024095A (ja) | 2019-03-05 | 2024-01-09 | 回転の補間と量子化による空間化オーディオコーディング |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP19305254.5A EP3706119A1 (fr) | 2019-03-05 | 2019-03-05 | Codage audio spatialisé avec interpolation et quantification de rotations |
EP19305254.5 | 2019-03-05 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020177981A1 true WO2020177981A1 (fr) | 2020-09-10 |
Family
ID=65991736
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2020/053264 WO2020177981A1 (fr) | 2019-03-05 | 2020-02-10 | Codage audio spatialisé avec interpolation et quantification de rotations |
Country Status (8)
Country | Link |
---|---|
US (1) | US11922959B2 (fr) |
EP (2) | EP3706119A1 (fr) |
JP (2) | JP7419388B2 (fr) |
KR (1) | KR20210137114A (fr) |
CN (1) | CN113728382A (fr) |
BR (1) | BR112021017511A2 (fr) |
WO (1) | WO2020177981A1 (fr) |
ZA (1) | ZA202106465B (fr) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022120011A1 (fr) * | 2020-12-02 | 2022-06-09 | Dolby Laboratories Licensing Corporation | Rotation de composantes sonores pour schémas de codage dépendant de l'orientation |
FR3118266A1 (fr) | 2020-12-22 | 2022-06-24 | Orange | Codage optimisé de matrices de rotations pour le codage d’un signal audio multicanal |
WO2022262576A1 (fr) * | 2021-06-18 | 2022-12-22 | 华为技术有限公司 | Procédé et appareil de codage de signal audio tridimensionnel, codeur et système |
EP4120255A1 (fr) | 2021-07-15 | 2023-01-18 | Orange | Quantification vectorielle spherique optimisee |
FR3136099A1 (fr) | 2022-05-30 | 2023-12-01 | Orange | Codage audio spatialisé avec adaptation d’un traitement de décorrélation |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160155448A1 (en) * | 2013-07-05 | 2016-06-02 | Dolby International Ab | Enhanced sound field coding using parametric component generation |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101450940B1 (ko) | 2007-09-19 | 2014-10-15 | 텔레폰악티에볼라겟엘엠에릭슨(펍) | 멀티채널 오디오의 조인트 인핸스먼트 |
CN102656628B (zh) * | 2009-10-15 | 2014-08-13 | 法国电信公司 | 优化的低吞吐量参数编码/解码 |
US9854377B2 (en) * | 2013-05-29 | 2017-12-26 | Qualcomm Incorporated | Interpolation for decomposed representations of a sound field |
CN104282309A (zh) | 2013-07-05 | 2015-01-14 | 杜比实验室特许公司 | 丢包掩蔽装置和方法以及音频处理系统 |
-
2019
- 2019-03-05 EP EP19305254.5A patent/EP3706119A1/fr not_active Withdrawn
-
2020
- 2020-02-10 WO PCT/EP2020/053264 patent/WO2020177981A1/fr unknown
- 2020-02-10 BR BR112021017511A patent/BR112021017511A2/pt unknown
- 2020-02-10 JP JP2021552656A patent/JP7419388B2/ja active Active
- 2020-02-10 CN CN202080031569.8A patent/CN113728382A/zh active Pending
- 2020-02-10 US US17/436,390 patent/US11922959B2/en active Active
- 2020-02-10 KR KR1020217031995A patent/KR20210137114A/ko unknown
- 2020-02-10 EP EP20703048.7A patent/EP3935629A1/fr active Pending
-
2021
- 2021-09-03 ZA ZA2021/06465A patent/ZA202106465B/en unknown
-
2024
- 2024-01-09 JP JP2024001364A patent/JP2024024095A/ja active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160155448A1 (en) * | 2013-07-05 | 2016-06-02 | Dolby International Ab | Enhanced sound field coding using parametric component generation |
Non-Patent Citations (1)
Title |
---|
ROUMEN KOUNTCHEV ET AL: "New method for adaptive karhunen-loeve color transform", TELECOMMUNICATION IN MODERN SATELLITE, CABLE, AND BROADCASTING SERVICES, 2009. TELSIKS '09. 9TH INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 7 October 2009 (2009-10-07), pages 209 - 216, XP031573422, ISBN: 978-1-4244-4382-6 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022120011A1 (fr) * | 2020-12-02 | 2022-06-09 | Dolby Laboratories Licensing Corporation | Rotation de composantes sonores pour schémas de codage dépendant de l'orientation |
FR3118266A1 (fr) | 2020-12-22 | 2022-06-24 | Orange | Codage optimisé de matrices de rotations pour le codage d’un signal audio multicanal |
WO2022136760A1 (fr) | 2020-12-22 | 2022-06-30 | Orange | Codage optimise de matrices de rotations pour le codage d'un signal audio multicanal |
WO2022262576A1 (fr) * | 2021-06-18 | 2022-12-22 | 华为技术有限公司 | Procédé et appareil de codage de signal audio tridimensionnel, codeur et système |
EP4120255A1 (fr) | 2021-07-15 | 2023-01-18 | Orange | Quantification vectorielle spherique optimisee |
WO2023285748A1 (fr) | 2021-07-15 | 2023-01-19 | Orange | Quantification vectorielle spherique optimisee |
FR3136099A1 (fr) | 2022-05-30 | 2023-12-01 | Orange | Codage audio spatialisé avec adaptation d’un traitement de décorrélation |
WO2023232823A1 (fr) | 2022-05-30 | 2023-12-07 | Orange | Titre: codage audio spatialisé avec adaptation d'un traitement de décorrélation |
Also Published As
Publication number | Publication date |
---|---|
JP2024024095A (ja) | 2024-02-21 |
US11922959B2 (en) | 2024-03-05 |
KR20210137114A (ko) | 2021-11-17 |
ZA202106465B (en) | 2022-07-27 |
JP2022523414A (ja) | 2022-04-22 |
JP7419388B2 (ja) | 2024-01-22 |
EP3706119A1 (fr) | 2020-09-09 |
EP3935629A1 (fr) | 2022-01-12 |
CN113728382A (zh) | 2021-11-30 |
BR112021017511A2 (pt) | 2021-11-16 |
US20220148607A1 (en) | 2022-05-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020177981A1 (fr) | Codage audio spatialisé avec interpolation et quantification de rotations | |
EP2374123B1 (fr) | Codage perfectionne de signaux audionumeriques multicanaux | |
EP1600042B1 (fr) | Procede de traitement de donnees sonores compressees, pour spatialisation | |
EP2002424B1 (fr) | Dispositif et procede de codage scalable d'un signal audio multi-canal selon une analyse en composante principale | |
EP2374124B1 (fr) | Codage perfectionne de signaux audionumériques multicanaux | |
EP3427260B1 (fr) | Codage et décodage optimisé d'informations de spatialisation pour le codage et le décodage paramétrique d'un signal audio multicanal | |
EP2143102B1 (fr) | Procede de codage et decodage audio, codeur audio, decodeur audio et programmes d'ordinateur associes | |
EP2168121B1 (fr) | Quantification apres transformation lineaire combinant les signaux audio d'une scene sonore, codeur associe | |
EP2005420A1 (fr) | Dispositif et procede de codage par analyse en composante principale d'un signal audio multi-canal | |
WO2010004155A1 (fr) | Synthese spatiale de signaux audio multicanaux | |
FR3045915A1 (fr) | Traitement de reduction de canaux adaptatif pour le codage d'un signal audio multicanal | |
Mahé et al. | First-order ambisonic coding with quaternion-based interpolation of PCA rotation matrices | |
EP4042418B1 (fr) | Détermination de corrections à appliquer a un signal audio multicanal, codage et décodage associés | |
WO2023232823A1 (fr) | Titre: codage audio spatialisé avec adaptation d'un traitement de décorrélation | |
EP2198425A1 (fr) | Procede, module et programme d'ordinateur avec quantification en fonction des vecteurs de gerzon | |
EP4172986A1 (fr) | Codage optimise d'une information representative d'une image spatiale d'un signal audio multicanal | |
FR3118266A1 (fr) | Codage optimisé de matrices de rotations pour le codage d’un signal audio multicanal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20703048 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2021552656 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
REG | Reference to national code |
Ref country code: BR Ref legal event code: B01A Ref document number: 112021017511 Country of ref document: BR |
|
ENP | Entry into the national phase |
Ref document number: 20217031995 Country of ref document: KR Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2020703048 Country of ref document: EP Effective date: 20211005 |
|
ENP | Entry into the national phase |
Ref document number: 112021017511 Country of ref document: BR Kind code of ref document: A2 Effective date: 20210902 |