EP4042418A1

EP4042418A1 - Determining corrections to be applied to a multichannel audio signal, associated coding and decoding

Info

Publication number: EP4042418A1
Application number: EP20792467.1A
Authority: EP
Inventors: Pierre Clément MAHE; Stéphane RAGOT; Jerome Daniel
Original assignee: Orange SA
Current assignee: Orange SA
Priority date: 2019-10-02
Filing date: 2020-09-24
Publication date: 2022-08-17
Anticipated expiration: 2040-09-24
Also published as: EP4042418B1; BR112022005783A2; KR20220076480A; ES2965084T3; WO2021064311A1; CN114503195A; JP2022550803A; ZA202203157B; US20220358937A1; FR3101741A1

Abstract

The invention relates to a method for determining a set of corrections (Corr.) to be made to a multichannel sound signal, in which the set of corrections is determined on the basis of an item of information representative of a spatial image of an original multichannel signal (Inf.B) and an item of information representative of a spatial image of the original multichannel signal that has been coded and then decoded (Inf. B). The invention also relates to a decoding method and a coding method implementing the determining method, and to the associated coding and decoding devices.

Description

DESCRIPTION

Title: Determination of corrections to be applied to a multichannel audio signal, associated encoding and decoding

The present invention relates to the encoding / decoding of spatialized sound data, in particular in a surround sound context (hereinafter also referred to as “ambisonic”).

The encoders / decoders (hereinafter called “coded”) which are currently used in mobile telephony are mono (a single signal channel for reproduction on a single loudspeaker). The 3GPP EVS (for “Enhanced Voice Services”) code makes it possible to offer “Super-HD” quality (also called “High Definition Rus” or HD + voice) with a super-widened audio band (SWB for “super- wideband ”in English) for signals sampled at 32 or 48 kHz or full band (FB for“ Fullband ”) for signals sampled at 48 kHz; the audio bandwidth is 14.4 to 16 kHz in SWB mode (9.6 to 128 kbit / s) and 20 kHz in FB mode (16.4 to 128 kbit / s).

The next quality evolution in conversational services offered by operators should be immersive services, using terminals such as smartphones equipped with several microphones or spatialized audio conferencing or videoconferencing equipment such as tele-presence or video. 360 °, or even “live” audio content sharing equipment, with spatialized 3D sound rendering that is far more immersive than a simple 2D stereo reproduction. With the increasingly widespread use of listening on mobile phones with an audio headset and the appearance of advanced audio equipment (accessories such as a 3D microphone, voice assistants with acoustic antennas, virtual reality headsets, etc.) the capture and rendering of spatialized sound scenes are now common enough to offer an immersive communication experience.

As such, the future 3GPP standard “IVAS” (for “Immersive Voice And Audio Services”) proposes the extension of the EVS code to the immersive by accepting as input format of the code at least the spatialized sound formats listed d- below (and their combinations):

- Multichannel format (channel-based in English) of stereo or 5.1 type where each channel feeds a speaker (for example L and R in stereo or L, R, Ls, Rs and C in 5.1);

- Object-based format where sound objects are described as an audio signal (generally mono) associated with metadata describing the attributes of this object (position in space, spatial width of the source, etc. ),

- Ambisonic format (scene-based in English) which describes the sound field at a given point, generally picked up by a spherical microphone or synthesized in the field of spherical harmonics.

Hereinafter, we are typically interested in the coding of a sound in ambisonic format, by way of exemplary embodiment (at least certain aspects presented in connection with the invention below can also be applied to other formats. than ambisonics).

Ambisonics is a recording method (“encoding” in the acoustic sense) of spatialized sound and a reproduction system (“decoding” in the acoustic sense). An ambisonic microphone (at order 1) comprises at least four capsules (typically of the cardioid or sub-cardioid type) arranged on a spherical grid, for example the vertices of a regular tetrahedron. The audio channels associated with these capsules are called “A-format”. This format is converted into a “B-format”, in which the sound field is broken down into four components (spherical harmonics) denoted W, X, Y, Z, which correspond to four coincident virtual microphones. The W component corresponds to an omnidirectional capture of the sound field while the X, Y and Z components, which are more directive, can be compared to microphones with pressure gradients oriented along the three orthogonal axes of space. An ambisonic system is a flexible system in the sense that recording and playback are separate and decoupled. It allows decoding (in the acoustic sense) on any speaker configuration (for example, binaural, 5.1-type “surround” sound or 7.1.4-type periphery (with elevation)). The ambisonics approach can be generalized to more than four channels in B-format and this generalized representation is commonly referred to as “HOA” (for “Higher-Order Ambisonics”). The fact of breaking down the sound on more spherical harmonics improves the spatial predsion of reproduction when rendering on loudspeakers. An ambisonic signal at order M comprises K = (M + 1) ² components and, at order 1 (if M = 1), we find the four components W, X, Y, and Z, commonly called FOA (for First-Order Ambisonics). There is also a so-called "planar" variant of ambisonics (W, X, Y) which decomposes the sound defined in a plane which is generally the horizontal plane. In this case, the number of components is K = 2M + 1 channels.

1st order ambisonics (4 channels: W, X, Y, Z), 1st order planar ambisonics (3 channels: W, X, Y), higher order ambisonics are all referred to here -after “ambisonics” indiscriminately to facilitate reading, the treatments presented being applicable independently of the planar type or not and of the number of ambisonic components.

Hereinafter, an “ambisonic signal” will be called a signal in B-format with a predetermined order with a certain number of ambisonic components. This also includes hybrid cases, where for example in order 2 there are only 8 channels (instead of 9) - more predsly, in order 2, we find the 4 channels of order 1 (W , X, Y, Z) to which we normally add 5 channels (usually denoted R, S, T, U, V), and we can for example ignore one of the higher order channels (for example

R).

The signals to be processed by the encoder / decoder are in the form of successions of blocks of sound samples called "frames" or "sub-frames" below.

In addition, hereafter, the mathematical notations follow the following convention:

- Scalar: s or N (lowercase for variables or uppercase for constants)

- the operator Re (.) designates the real part of a complex number

- Vector: u (lowercase, bold)

- Matrix: A (uppercase, bold)

The notations A ^T and A ^H indicate respectively the transposition and the Hermitian transposition (transposed and conjugated) of A.

- A one-dimensional discrete-time signal, s (i), defined over a time interval i = 0, ..., L-1 of length L is represented by a row vector s = [s (0,) ..., s (L-1)]

We can also write: s = [S ₀ , .., S _L-1 ] to avoid the use of parentheses.

- A multidimensional discrete-time signal, b (i), defined over a time interval i = 0, .., L-1 of length L and with K dimensions is represented by a matrix of size

We can also note: B = [B _ij ], i = 0, .. K-1, j = 0..L-1, to avoid the use of parentheses.

- A 3D point of Cartesian coordinates (x, y, z) can be converted to spherical coordinates (r, Θ, φ), where r is the distance to the origin, Θ is the azimuth and φ the elevation. We use here, without loss of generality, the mathematical convention where the elevation is defined with respect to the horizontal plane (0xy); the invention can be easily adapted to other definitions, including the convention used in physics where the azimuth is defined with respect to the Oz axis.

In addition, we do not recall here the conventions known to the state of the art in ambisonics concerning the order of the ambisonic components (including ACN for Ambisonic Channel Number, SID for Sngle Index Designation, FUMA for Furse-Malham) and the normalization of ambisonic components (SN3D, N3D, maxN). More details can be found for example in the resource available online: https://en.wikipedia.org/wiki/Ambisonic data exchange formats By convention, the first component of an ambisonic signal generally corresponds to the omnidirectional component W .

The simplest approach to encoding an ambisonic signal is to use a mono encoder and apply it in parallel to all channels, possibly with different bit allocation depending on the channel. This approach is referred to herein as “multi-mono”. The multi-mono approach can be extended to multi-stereo coding (where pairs of channels are coded separately by a stereo coded) or more generally to the use of several parallel instances of the same core coded.

Such an embodiment is shown in FIG. 1. The input signal is divided into channels (a mono channel or several channels) by the block 100. These channels are coded separately by the blocks 120 to 122 according to a distribution and of a predetermined binary allocation. Their bit stream is multiplexed (block 130) and after transmission and / or storage, it is demultiplexed (block 140) to apply a decoding to reconstruct the decoded channels (blocks 150 to 152) which are recombined (block 160).

The associated quality varies depending on the core encoding and decoding used (blocks 120 to 122 and 150 to 152), and it is generally only satisfactory at very high speed. For example, in the multimono case, the EVS encoding can be considered quasi-transparent (perceptually) at a rate of at least 48 kbit / s per channel (mono); thus for an ambisonic signal of order 1 we obtain a minimum bit rate of 4x48 = 192 kbit / s.

The multi-mono coding approach does not take into account the correlation between channels, it produces spatial distortions with the addition of different artefacts such as the appearance of phantom sound sources, diffuse noise or movements of the trajectories of sound sources. . Thus, the encoding of an ambisonic signal according to this approach generates degradation of spatialization.

An alternative approach to coding all channels separately is given, for a stereo or multichannel signal, by parametric coding. For this type of encoding, the input multichannel signal is reduced to a smaller number of channels, after a processing called "downmix", these channels are encoded and transmitted and additional spatialization information is also encoded. Parametric decoding consists in increasing the number of channels after decoding of the transmitted channels, by using a processing called “upmix” (typically implemented by decorrelation) and a spatial synthesis as a function of the additional decoded spatialization information. An example of stereo parametric coding is given by the 3GPP e-AAC + codec. It should be noted that the downmix operation also generates degradation of spatialization; in this case, the spatial image is changed.

The invention improves the state of the art.

For this purpose, it proposes a method for determining a set of corrections to be made to a multichannel sound signal, in which the set of corrections is determined from information representative of a spatial image of a multichannel signal. source and a piece of information representative of a spatial image of the original encoded and decoded multichannel signal.

Thus, the determined set of corrections, to be applied to the decoded multichannel signal, makes it possible to limit the spatial degradations due to the coding and possibly to channel reduction / increase operations. The implementation of the correction thus makes it possible to find a spatial image of the decoded multichannel signal closest to the spatial image of the original multichannel signal.

In a particular embodiment, the determination of the set of corrections is performed in the full band time domain (a frequency band). In variants, it is performed in the time domain by frequency sub-band. This makes it possible to adapt the corrections according to the frequency bands. In other variants, it is performed in a real or complex transformed domain (typically frequency) of the short-term discrete Fourier transform (STFT), modified discrete cosine transform (MDCT), or other type.

The invention also relates to a method for decoding a multichannel sound signal, comprising the following steps:

- reception of a binary stream comprising an encoded audio signal from an original multichannel signal and information representative of a spatial image of the original multichannel signal;

- decoding of the received encoded audio signal and obtaining a decoded multichannel signal;

- decoding of information representative of a spatial image of the original multichannel signal;

- determination of information representative of a spatial image of the decoded multichannel signal;

- determination of a set of corrections to be made to the decoded signal according to the determination method described above;

- correction of the multi-channel signal decoded by the determined set of corrections. Thus, in this embodiment, the decoder is able to determine the corrections to be made to the decoded multichannel signal, from information representative of the spatial image of the original multichannel signal, received from the encoder. The information received from the encoder is thus limited. It is the decoder that takes care of both determining and applying corrections.

The invention also relates to a method for encoding a multichannel sound signal, comprising the following steps:

- encoding of an audio signal from an original multichannel signal;

- determination of information representative of a spatial image of the original multichannel signal;

- local decoding of the encoded audio signal and obtaining a decoded multichannel signal;

- determination of information representative of a spatial image of the signal multi-channel decoded;

- determination of a set of corrections to be made to the decoded multichannel signal according to the determination method described above;

- coding of the determined set of corrections.

In this embodiment, it is the encoder which determines the set of corrections to be made to the decoded multichannel signal and which transmits it to the decoder. It is therefore the coder who initiates this determination of corrections.

In a first particular embodiment of the decoding method as described above or of the encoding method as described above, the information representative of a spatial image is a covariance matrix and the determination of the set of corrections comprises in in addition to the following steps:

- obtaining a weighting matrix comprising weighting vectors associated with a set of virtual loudspeakers;

- determination of a spatial image of the original multichannel signal from the weighting matrix obtained and from the covariance matrix of the original multichannel signal received;

- determination of a spatial image of the decoded multichannel signal from the weighting matrix obtained and from the covariance matrix of the determined decoded multichannel signal;

- calculating a ratio between the spatial image of the original multichannel signal and the spatial image of the decoded multichannel signal at the speaker directions of the virtual speaker set, to obtain a set of gains.

According to this embodiment, this method using rendering on loudspeakers makes it possible to transmit only a limited quantity of data from the encoder to the decoder. Indeed, for a given order M, K = (M + 1) ² coefficients to be transmitted (associated with as many virtual loudspeakers) may be sufficient, but for a more stable correction it may be recommended to use more height -virtual speakers and therefore transmit more points. In addition, the correction can be easily interpreted in terms of the gains associated with virtual loudspeakers.

In another variant embodiment, in the case where the encoder directly determines the energy of the signal in different directions and transmits this spatial image of the original multichannel signal to the decoder, the determination of the set of corrections of the decoding method comprises furthermore the following steps: - Obtaining a weighting matrix comprising weighting vectors associated with a set of virtual loudspeakers;

- determination of a spatial image of the decoded multichannel signal from the weighting matrix obtained and from the information representative of a spatial image of the determined decoded multichannel signal;

In order to guarantee a correction value which is not too abrupt, the decoding method or the encoding method comprises a step of limiting the values of gains obtained according to at least one threshold.

Get set of gains constitutes the set of corrections and may for example be in the form of a correction matrix comprising all of the gains thus determined.

In a second particular embodiment of the decoding method or of the encoding method, the information representative of a spatial image is a covariance matrix and the determination of the set of corrections comprises a step of determining a matrix of transformation by matrix decomposition of the two covariance matrices, the transformation matrix constituting the set of corrections.

This embodiment has the advantage of making the corrections directly in the ambisonic domain in the case of an ambisonic multichannel signal. The steps of transforming the signals reproduced on loudspeakers into the ambisonic domain are thus avoided. This embodiment also makes it possible to optimize the correction so that it is mathematically optimal even if it requires the transmission of a greater number of coefficients compared to the method with rendering on loudspeakers. Indeed, for an order M and consequently a number of components K = (M + 1) ² , the number of coefficients to be transmitted is Kx (K + 1) / 2. In order to avoid amplifying too much in certain frequency zones, a normalization factor is determined and applied to the transformation matrix. In the case where the set of corrections is represented by a transformation matrix or a correction matrix as described above, the correction of the multi-channel signal decoded by the determined set of corrections is performed by the application of the set of corrections to the decoded multichannel signal, that is to say directly in the ambisonic domain in the case of an ambisonic signal.

In the speaker rendering embodiment implemented by the decoder, the correction of the multichannel signal decoded by the determined set of corrections is performed according to the following steps:

- acoustic decoding of the decoded multichannel signal on the defined set of virtual speakers;

- application of the set of gains obtained to the signals resulting from the acoustic decoding;

- acoustic coding of the signals resulting from the acoustic decoding and corrected to obtain components of the multichannel signal;

summation of the components of the multichannel signal thus obtained to obtain a corrected multichannel signal.

In an alternative embodiment, the steps of decoding, applying gains and encoding / summing above are grouped together in a direct correction operation by a correction matrix. This correction matrix can be applied directly to the decoded multichannel signal, which has the advantage as described above of making the corrections directly in the ambisonic domain.

In a second embodiment, where the encoding method implements the method for determining all of the corrections, the decoding method comprises the following steps:

- reception of a binary stream comprising an encoded audio signal from an original multichannel signal and a coded set of corrections to be made to the decoded multichannel signal, the set of corrections having been coded according to a coding method described above;

- decoding of the received encoded audio signal obtaining a decoded multichannel signal;

- decoding of the coded set of corrections;

- correction of the decoded multichannel signal by applying the decoded set of corrections to the decoded multichannel signal.

In this embodiment, it is the encoder which determines the corrections to be made to the decoded multichannel signal, directly in the ambisonic domain and it is the decoder which implements the application of these corrections to the decoded multichannel signal, directly in the ambisonic domain. The set of corrections can in this case be a transformation matrix or else a correction matrix comprising a set of gains.

In an alternative embodiment of the decoding method by rendering on loudspeakers, the decoding method comprises the following steps:

- reception of a binary stream comprising an encoded audio signal originating from an original multichannel signal and a coded set of corrections to be made to the decoded multichannel signal, the set of corrections having been coded according to a coding method as described previously;

- decoding of the coded set of corrections;

- correction of the multi-channel signal decoded by the set of decoded corrections according to the following steps:

. acoustic decoding of the decoded multichannel signal on the defined set of virtual speakers;

. application of the set of gains obtained to the signals resulting from the acoustic decoding;

. acoustic coding of the signals resulting from the acoustic decoding and corrected to obtain components of the multichannel signal;

. summation of the components of the multichannel signal thus obtained to obtain a corrected multichannel signal.

In this embodiment, it is the encoder which determines the corrections to be made to the signals resulting from the acoustic decoding on a set of virtual loudspeakers and it is the decoder which implements the application of these corrections to the signals. signals resulting from acoustic decoding then which transforms these signals to return to the ambisonic domain in the case of an ambisonic multichannel signal.

In an alternative embodiment, the steps of decoding, applying gains and encoding / summing above are grouped together in a direct correction operation by a correction matrix. The correction is then carried out directly by applying a correction matrix to the decoded multichannel signal, for example the ambisonic signal. As described previously, this has the advantage of making corrections directly in the Ambisonic domain.

The invention also relates to a decoding device comprising a processing circuit for implementing the decoding methods as described above. The invention also relates to a decoding device comprising a processing circuit for implementing the coding methods as described above.

The invention relates to a computer program comprising instructions for implementing decoding methods or encoding methods as described above, when they are executed by a processor.

Finally, the invention relates to a storage medium, readable by a processor, storing a computer program comprising instructions for carrying out the decoding methods or the encoding methods described above.

Other characteristics and advantages of the invention will emerge more clearly on reading the following description of particular embodiments, given by way of simple illustrative and non-limiting examples, and the accompanying drawings, among which:

[Fig 1] Figure 1 illustrates multi-mono coding according to the state of the art and as described above;

[Fig 2] Figure 2 illustrates in flowchart form the steps of a method for determining a set of corrections according to one embodiment of the invention;

[Fig 3] Figure 3 illustrates a first embodiment of an encoder and a decoder, an encoding method and a decoding method according to the invention; [Fig 4] FIG. 4 illustrates a first detailed embodiment of the block for determining the set of corrections;

[Fig 5] FIG. 5 illustrates a second detailed embodiment of the block for determining the set of corrections;

[Fig 6] Figure 6 illustrates a second embodiment of an encoder and a decoder, a coding method and a decoding method according to the invention; and [Fig 7] Figure 7 illustrates examples of structural embodiments of a coder and of a decoder according to one embodiment of the invention.

The method described below is based on the correction of spatial degradations, in particular to ensure that the spatial image of the decoded signal is as close as possible to the original signal. Unlike the parametric coding approaches known for stereo or multichannel signals, where perceptual attributes are coded, the invention is not based on a perceptual interpretation of spatial image information because the ambisonic domain is not directly “listenable”.

FIG. 2 represents the main steps implemented to determine a set of corrections to be applied to the encoded and then decoded multichannel signal.

The original multichannel signal B of dimension KxL (ie K components of L time or frequency samples) is input to the determination method. In step S1, information representative of a spatial image of the original multichannel signal is extracted.

We are interested here in the case of a multichannel signal in ambisonic representation, as described previously. The invention can also be applied for other types of multichannel signal such as a B-format signal with modifications, such as, for example, the removal of certain components (e.g. removal of the R component at order 2 in order to keep only 8 channels) or the matrixing of the B-format to pass into an equivalent domain (called “Equivalent Spatial Domain”) as described in the specification 3GPP TS 26.260 - another example of matrixing is given by the “channel mapping 3” of the IETF Opus coded and in the 3GPP TS 26.918 specification (dause 6.1.6.3).

The distribution of sound energy from the ambisonic soundstage to different directions in space is referred to here as a "spatial image"; in variants, this spatial image describing the sound scene generally corresponds to positive quantities evaluated at different predetermined directions in space, for example in the form of a pseudo-spectrum of the MUSIC (Multiple Signal Classification) type sampled at these directions or a histogram of directions of arrival (where the directions of arrival are counted according to the discretization given by the predetermined directions); these positive quantities can be interpreted as energies and are seen as such hereafter to simplify the description of the invention.

A spatial image associated with an ambisonic sound scene therefore represents the sound energy (or more generally a positive quantity) relative as a function of different directions in space. In the invention, a piece of information representative of a spatial image can be for example a covariance matrix calculated between the channels of the multichannel signal or else an information of energy associated with directions of origin of the sound (associated with directions of height. - virtual speakers distributed over a unity sphere). The set of corrections to be applied to a multichannel signal is a piece of information which can be defined by a set of gains associated with directions of origin of the sound which can be in the form of a matrix of corrections comprising this set of gains or a transformation matrix.

A covariance matrix of a multichannel signal B is for example obtained in step S1. As described later with reference to FIGS. 3 and 6, this matrix is for example calculated as follows:

C = BB ^{T up} to a normalization factor (in the real case) or

C = Re (BB ^H ) up to a normalization factor (in the complex case)

In variants, operations of temporal smoothing of the covariance matrix could be used. In the case of a multichannel signal in the time domain, the covariance can be estimated recursively (sample by sample) in the form:

Qj (n) = n / (n + 1) Qj (n-1) + 1 / (n + 1) bi (n) bj (n).

In an alternative embodiment, energy information is obtained in different directions (associated with directions of virtual loudspeakers distributed over a unit sphere). For this, an SRP (for “Steered-Response Power”) type method described later with reference to FIGS. 3 and 4 could for example be applied. In variations, other spatial image computation methods (MUSIC pseudo-spectrum, arrival direction histogram) can be used.

Several embodiments are possible and described here to encode the original multichannel signal.

In a first embodiment, the various channels b _k , k = 0, ... K-1, of B are coded, in step S2, by multi-mono coding, each channel b _k being coded separately. In alternative embodiments, multi-stereo coding where the channels b _k are coded in separate pairs is also possible. A typical example for a 5.1 input signal is to use two separate stereo encodings of L / R and Ls / Rs with mono encodings of C and LFE (low frequencies only); for the ambisonic case, the multi-stereo coding can be applied to the ambisonic components (B-format) or to an equivalent multichannel signal obtained after matrixing of the B-format channels - for example at order 1 the channels W, X, Y, Z can be converted to four transformed channels and two pairs of channels are encoded separately and converted back to B-format on decoding. An example is given in recent versions of the Opus code (“channel mapping 3”) and in specification 3GPP TR 26.918 (dause 6.1.6.3).

In other variants, it is also possible to use in step S2 a joint multichannel coding, such as for example the MPEG-H 3D Audio coded for the ambisonic format (scene-based); in this case, the codec performs coding of the input channels jointly. In the MPEG-H example, this joint coding is broken down for an ambisonic signal into several steps such as the extraction and coding of predominant mono sources, the extraction of an ambience (typically reduced to an ambisonic signal of order 1 ), the coding of all the extracted channels (called “transport channels”) and of metadata describing the acoustic beamforming vectors for the extraction of predominant channels. Joint multichannel encoding makes it possible to exploit the relationships between all channels to, for example, extract predominant audio sources and ambience or perform global bit allocation taking into account all audio content.

In the preferred embodiment, the embodiment of step S2 is taken as a multi-mono coding which is carried out using the 3GPP EVS code as described above. However, the method according to the invention can thus be used independently of the core coded (multi-mono, multi-stereo, joint coding) used to represent the channels to be coded.

The signal thus encoded in the form of a bitstream can be decoded in step S3 either by a local decoder of the encoder, or by a decoder after transmission. The signal is decoded to find the channels of the multichannel signal S (for example by several instances of decoder EVS according to a multi-mono decoding).

Steps S2a, S2b, S3a, S3b represent an alternative embodiment of the encoding and decoding of the multichannel signal B. The difference with the encoding of step S2 described above lies in the use of additional processing operations for reducing the number. of channels (“downmix” in English) in step S2a and increase in the number of channels (“upmix” in English) in step S3b. Ges encoding and decoding steps (S2b and S3a) are similar to steps S2 and S3 except that the number of respective input and output channels is lower in steps S2b and S3a An example of a downmix for a first-order ambisonic input signal is to keep only the W channel; for an ambisonic input signal of order> 1, we can take as a downmix the first 4 components W, X, Y, Z (therefore truncate the signal to order 1). In variants, we could take as a downmix a subset of the ambisonic components (for example 8 channels at order 2 without the R component) and also consider the matrixing cases such as for example a stereo downmix obtained in the form: L = W-Y + 0.3 * X, R = W + Y + 0.3 * X (using only FOA channels).

An example of upmixing a mono signal consists of applying different room spatial impulse responses (SRIR for "Spatial Room Impulse Response") or different decorrelator filters (of the all-pass type) in the time or frequency domain. An exemplary embodiment of decorrelation in a frequency domain is given for example in document 3GPP S4-180975, pCR to 26.118 on Dolby VRStream audio profile candidate (dause X6.2.3.5).

The signal B * resulting from this “downmix” processing is coded in step S2b by a core coded (multi-mono, multi-stereo, joint coding), for example by a mono or multi-mono approach with the coded 3GPP EVS . The audio signal input from encoding step S2b and output from decoding step S3 has fewer channels than the original multi-channel audio signal. In this case, the spatial image represented by the core coded is already significantly degraded even before the coding. In an extreme case, the number of channels is reduced to a single mono channel, by encoding only the W channel; the input signal is then limited to a single audio channel and the spatial image is therefore lost. The method according to the invention makes it possible to describe and reconstruct this spatial image as close as possible to that of the original multichannel signal.

At the output of the upmix step in S3b of this variant embodiment, there is a decoded multichannel signal 8.

From the decoded multichannel signal 8 according to the two variants (S2-S3 or S2a-S2b-S3a-S3b), is extracted, in step S4, information representative of the spatial image of the decoded multichannel signal. As for the original image, this information can be a covariance matrix calculated on the decoded multichannel signal or else an information of energy associated with directions of origin of the sound (or in an equivalent way, with virtual points on a unit sphere ). The information representative of the original multichannel signal and of the decoded multichannel signal is used in step S5 to determine a set of corrections to be made to the decoded multichannel signal in order to limit the spatial degradations.

Two embodiments will be detailed below with reference to Figures 4 and

5 to illustrate this step.

The method described in FIG. 2 can be implemented in the time domain, in full frequency band (with a single band) or else by frequency sub-bands (with several bands), this does not change the operation of the process, each sub-band then being treated separately. If the method is carried out by sub-band, the set of corrections is then determined by sub-band, which causes an additional cost of calculation and of data to be transmitted to the decoder compared to the case of a single band. The division into sub-bands can be uniform or non-uniform. For example, we can divide the spectrum of a signal sampled at 32 kHz according to different variants:

- 4 bands of respective width 1, 3, 4 and 8 kHz or 2, 2, 4, 8 kHz

- 24 Bark bands (100 Hz wide at low frequencies to 3.5-4 kHz for the last sub-band)

- the 24 Bark bands can optionally be grouped into blocks of 4 or

6 successive bands to form a set of respectively 6 or 4 “agglomerated” bands.

Other splits are possible (for example ERB bands - for "equivalent rectangular bandwidth" in English - or in 1/3 octave), including for the case of a different sampling frequency (for example 16 or 48 kHz).

In variants, the invention may also be implemented in a transform domain, for example in the domain of the short-term discrete Fourier transform (STFT) or the domain of the modified discrete cosine transform

(MDCT).

Several embodiments are now described for implementing the determination of this set of corrections and for applying this set of corrections to the decoded signal.

We recall here the known technique of encoding a sound source in ambisonic format. A mono sound source can be artificially spatialized by multiplying its signal by the values of the spherical harmonics associated with its direction of origin (assuming the signal carried by a plane wave) to obtain as many ambisonic components. For this, we calculate the coefficients for each spherical harmonic for a position determined in azimuth Θ and in elevation φ to the desired order:

Β = Y (Θ, φ) .s where s is the mono signal to spatialize and Y (Θ, φ) is the encoding vector defining the coeffidents of the spherical harmonics associated with the direction (Θ, φ) for the order M. An example of an encoding vector is given below for order 1 with the SN3D convention and the order of the SI D or FuMa channels:

10

In variants, other standardization conventions (eg: maxN, N3D) and order of the channels (eg: ACN) may be used and the different embodiments are then adapted according to the convention used for the order. or the standardization of ambisonic components (PDA or HOA). This is the same as changing the order of the Y (Θ, lignes) lines or multiplying these lines by predefined constants.

For higher orders, the Y (Θ, φ) coefficients of the spherical harmonics can be found in the book by B. Rafaely, Fundamentals of Spherical Array Processing, Springer, 2015. In general, for an M order, the ambisonic signals are at number of K = (M + 1) ² .

Likewise, we recall here some notions on rendering or ambisonic reproduction by loudspeakers. Ambisonic sound is not meant to be heard as it is; for immersive listening on speakers or headphones, a step of “decoding” in the acoustic sense also called rendering (“r integer” in English) must be carried out. We consider the case of N loudspeakers (virtual or physical) distributed over a sphere - typically of unit radius - and whose directions (Θ _n , φ _n ), n = 0, .., N-1, in terms of azimuth and elevation are known. Decoding, as considered here, is a linear operation which consists in applying a matrix D to the ambisonic signals B to obtain the signals Sn from the loudspeakers, which can be gathered into a matrix S = [S ₀ , ... S _N-1 ], S = DB where un can decompose the matrix D into row vectors d _n , i.e. d _n can be seen as a weighting vector for the nth loudspeaker, used to recombine the components of the ambisonic signal and calculate the signal played on the nth loudspeaker: Sn = d _n .B.

There are many methods of "decoding" in the acoustic sense. The so-called “basic decoding” method, also called “mode-matching”, is based on the encoding matrix E associated with all the directions of virtual loudspeakers:

According to this method, the matrix D is typically defined as the pseudo-inverse of E: D = pinv (E) = D ^T (DD ^T ) ^-1

As an alternative, the method which one can call “projection” gives similar results for certain regular distributions of directions, and is described by the equation:

In the latter case, we see that for each direction of index n,

In the context of this invention, such matrices will serve as a matrix for forming directional beams ("beamforming" in English) describing how to obtain signals characteristic of directions of space in order to carry out an analysis and / or transformations. space. In the context of the present invention, it is useful to describe the reciprocal conversion to pass from the loudspeaker domain to the ambisonic domain. The successive application of the two conversions should accurately reproduce the original ambisonic signals if no intermediate modification is applied in the loudspeaker area. We therefore define the reciprocal conversion as involving the pseudo-inverse of D: pinv (D) .S = D ^T (DD ^T ) ^-1 .S

When K = (M + 1) ² , the matrix D of size KxK is invertible under certain conditions and in this case: B = D ^-1 .S

In the case of the "mode-matching" method, it appears that pinv (D) = E In variants, other D decoding methods can be used, with the corresponding inverse E conversion; the only condition to check is that the combination of the decoding by D and the inverse conversion by E must give a perfect reconstruction (when no intermediate processing is carried out between the acoustic decoding and the acoustic encoding).

Such variants are for example given by:

- "mode-matching" decoding with a regulation term in the form D ^T (DD ^T + εl) ^-1 where ε is a low value (for example 0.01),

- "in phase" or "max-rE" decoding known to the state of the art

- or variants where the distribution of the directions of the loudspeakers is not regular on the sphere.

FIG. 3 represents a first embodiment of an encoding device and of a decoding device for the implementation of an encoding and decoding method including a method for determining a set of corrections as described. with reference to figure 2.

In this embodiment, the encoder calculates information representative of the spatial image of the original multichannel signal and transmits it to the decoder to enable it to correct the spatial degradation caused by the encoding. This allows during decoding to attenuate spatial artefacts in the decoded ambisonic signal.

Thus, the encoder receives a multichannel input signal of, for example, an FOA ambisonic representation, or HOA, or a hybrid representation with a subset of ambisonic components up to a given partial ambisonic order - the latter case is in fact undue. equivalent way in the case of FOA or HOA where the missing ambisonic components are zero and the ambisonic order is given by the order minimum required to indure all defined components. Thus, without loss of generality, the FOA or HQA cases are considered in the remainder of the description.

In the embodiment thus described, the input signal is sampled at 32 kHz. The encoder operates in frames which are preferably 20 ms in length, ie L = 640 samples per frame at 32 kHz. In variations, other frame lengths and sampling rates are possible (eg L = 480 samples per frame of 10 msec at 48 kHz).

In a preferred embodiment, the coding is performed in the time domain (on one or more bands), however in variants, the invention can be implemented in a transformed domain, for example after a short discrete Fourier transform. term (STFT) or modified discrete cosine transform (MDCT).

According to the coding embodiment used, as explained with reference to FIG. 2, a block 310 for reducing the number of channels (DMX) can be implemented; the input of block 311 is signal B * at the output of block 310 when the downmix is implemented or signal B otherwise. In one embodiment, if the downmix is applied, it consists, for example, for an ambisonic input signal of order 1 to keep only the channel W and for an ambisonic input signal of order> 1, to not keep only the first 4 ambisonic components W, X, Y, Z (therefore to truncate the signal at order 1). Other types of downmix (such as those described above with a selection of a subset of channels and / or matrixing) can be implemented without modifying the process according to the invention.

Block 311 encodes the audio signal b'k of B * at the output of block 310 in the case where the downmix step is performed or the audio signal bk of the original multichannel signal B. This signal corresponds to the ambisonic components of the signal. original multichannel if no channel count reduction processing has been applied.

In a preferred embodiment, block 311 uses multi-mono coding (COD) with fixed or variable allocation, where the core codec is the 3GPP EVS standardized codec. In this multi-mono approach, each bk or b'k channel is coded separately by an instance of the coded; however, in variations other coding methods are possible, for example multi-stereo coding or joint multichannel coding. Therefore, at the output of this coding block 311, an encoded audio signal originating from the original multichannel signal is obtained, in the form of a binary train which is sent to the multiplexer 340.

Optionally, block 320 performs a sub-band division. In variants, this division into sub-bands could reuse equivalent processing operations carried out in blocks 310 or 311; the separation of block 320 is here functional. In a preferred embodiment, the channels of the original multichannel audio signal are divided into 4 frequency sub-bands of respective width 1 kHz, 3 kHz, 4 kHz, 8 kHz (which amounts to a division of the frequencies according to the 0 -1000, 1000- 4000, 4000-8000 and 8000-16000 Hz. Oe slicing can be implemented by means of a short-term discrete Fourier transform (STFT), band-pass filtering in the Fourier domain (by application of a frequency mask), and inverse transform with overlap addition In this case, the sub-bands remain sampled at the same original frequency and the processing according to the invention is applied in the time domain; variants, it is possible to use a filter bank with a critical sampling. It will be noted that the sub-band cutting operation generally involves a processing delay which is a function of the type of filter bank used; invention a time alignment can be applied ique before or after encoding-decoding and / or before the extraction of spatial image information, so that the spatial image information is well synchronized in time with the corrected signal.

In variants, full-band processing may be carried out, or the sub-band cutting may be different as explained previously.

In other variations, the signal from a transform of the original multichannel audio signal is directly used and the invention is applied in the transformed domain with subband slicing in the transformed domain.

In the remainder of the description, the various stages of coding and decoding are described as if it were a processing in the time or frequency domain (real or complex) with a single frequency band in order to simplify the description. .

It is also possible to implement, optionally, in each sub-band, a high-pass filtering (with a cut-off frequency typically at 20 or 50 Hz), for example in the form of an elliptical IIR filter of order 2 whose frequency of cut-off is preferably set at 20 or 50 Hz (50 Hz in some variants). Ge preprocessing avoids a potential bias for the subsequent estimation of covariance during coding; without this preprocessing, the correction implemented in block 390 described later will tend to amplify the low frequencies during full band processing.

Block 321 determines (Inf. B) information representative of a spatial image of the original multichannel signal.

In one embodiment, this information is energy information associated with directions of origin of sound (associated with directions of virtual speakers distributed over a unit sphere).

To do this, we define a virtual 3D sphere of unit radius, this 3D sphere is discretized by N points (“point” virtual speakers) whose position is defined in spherical coordinates by the directions (Θ _n , φ _n ) for the nth speaker. The loudspeakers are typically placed (almost) uniformly on the sphere. The number N of virtual loudspeakers is determined as a discretization having at least N = K points, with M the ambisonic order of the signal and K = (M + 1) ² , ie N≥K. A “Lebedev” type quadrature method can for example be used to perform this discretization, according to the references Vl Lebedev, and DN Laikov, “A quadrature formula for the sphere of the 131st algebraic order of accuracy”, Doklady Mathematics, vol. 59, no. 3, 1999, pp. 477- 481 or Pierre Lecomte, Philippe-Aubert Gauthier, Christophe Langrenne, Alexandre Garcia and Alain Berry, On the use of a Lebedev grid for Ambisonics, AES Convention 139, New York, 2015.

In variants, other discretizations can be used, such as for example a Riege discretization with at least N = K points (N≥K), as described in the reference J. Riege und U. Maier, “A two-stage approach for computing cubature formulated for the sphere ”, Technical Report, Dortmund University, 1999 or a discretization taking the points of a“ spherical t-design ”as described in the article by R H. Hardin and NJ A Soane,“ McLaren's Improved Snub Cube and Other New Spherical Designs in Three Dimensions ”, Discrète and Gomputational Geometry, 15 (1996), pp. 429-441.

From this discretization, it is possible to determine the spatial image of the multichannel signal. One possible method is for example the SRP method (for "Steered- Response Power ”in English). Indeed, this method consists in calculating the short-term energy coming from different directions defined in terms of azimuth and elevation. For this, as explained previously, similarly to rendering on N speakers, a weighting matrix of the ambisonic components is calculated, then this matrix is applied to the multichannel signal to sum the contribution of the components and produce a set of N acoustic beams (or "beamformers" in English).

The signal from the acoustic beam for the direction (Θ _n , φ _n ) of the nth loudspeaker is given by: S _n = d _n .B where d _n is the weighting vector (line) giving the beam formation coefficients acoustic for the given direction and B is a matrix of size KxL representing the ambisonic signal (B-format) with K components, over time interval of length L

The set of signals from the N acoustic beams leads to the equation: S = DB where ù and S is a matrix of size NxL representing the signals of N virtual loudspeakers over a time interval of length L

The short-term energy on the time segment of length L for each direction

(Θ _n , φ _n ) is: where C = BB ^T (real case) or Re (BB ^H ) (complex case) is the covariance matrix of B.

Each term σ _n ² = s _n .s _nT can be calculated thus for the set of directions

(Θ _n , φ _n ) which correspond to a discretization of the 3D sphere by virtual speakers.

The spatial image ∑ is then given by:

Other variations of the calculation of a spatial image ∑ than the SRP method, can be used.

- The d _n values may vary depending on the type of acoustic beam forming used (delay-sum, MVDR, LCMV, etc.). The invention also applies to these variant calculations of the matrix D and of the spatial image.

- The MUSIC method (MUItiple Sgnal Classification) also provides another way of calculating a spatial image, with a subspace approach.

The invention also applies in this variant of calculation of the spatial image. which corresponds to the MUSIC pseudo-spectrum calculated by diagonalizing the covariance matrix and evaluated for the directions (Θ _n , φ _n ).

- The spatial image can be calculated from a histogram of the intensity vector (at order 1) as for example in the article by S. Tervo, Direction estimation based on sound intensity vectors, Proc. EUSI PCOO, 2009, or its generalization into a pseudo-intensity vector. In this case, ('histogram (whose values are the number of occurrences of values of arrival directions according to the predetermined directions (Θ _n , φ _n )) is interpreted as a set of energies according to the predetermined directions.

Block 330 then quantizes the spatial image thus determined, for example with 16-bit scalar quantization by coefficients (directly using the 16-bit truncated floating point representation). In variations, other scalar or vector quantization methods are possible. In another embodiment, the information representative of the spatial image of the original multichannel signal is a covariance matrix (of the subbands) of the input channels B. This matrix is calculated as:

C = BB ^{T up} to a normalization factor (in the real case).

If the invention is implemented in a domain by transforming complex values, this covariance is calculated as: C = Re (BB ^H ) up to a normalization factor.

In variants, operations of temporal smoothing of the covariance matrix could be used. In the case of a multichannel signal in the time domain, the covariance can be estimated recursively (sample by sample).

The covariance matrix C (of size Kx (K) being, by definition, symmetric, only one of the lower or upper triangles is transmitted to the quantization block 330 which codes (Q) K (K + 1) / 2 coefficients, K being the number of ambisonic components. This block 330 performs a quantization of these coefficients, for example with a scalar quantization on 16 bits by coefficient (by using directly the floating point representation truncated on 16 bits). scalar or vector quantization of the covariance matrix can be implemented.For example, we can calculate the maximum value (maximum variance) of the covariance matrix then code by scalar quantization with a logarithmic step, on a number of bits more low (for example 8 bits), the values of the upper (or lower) triangle of the covariance matrix normalized by its maximum value.

In variants, the covariance matrix C could be regularized before quantification in the form C + εl.

The quantized values are sent to multiplexer 340.

In this embodiment, the decoder receives in the demultiplexer block 350, a bit stream comprising an encoded audio signal from the original multichannel signal and information representative of a spatial image of the original multichannel signal.

Block 360 decodes (Q ¹ ) the covariance matrix or other information representative of the spatial image of the original signal. Block 370 decodes (DEC) the audio signal as represented by the bit stream.

In one embodiment of the encoding and decoding, not implementing the downmix and upmix steps, the decoded multichannel signal is obtained at the output of decoding block 370.

In the embodiment where the downmix step has been used in encoding, the decoding implemented in block 370 provides a decoded audio signal which is input to upmix block 371.

Thus, block 371 implements an optional step (UPMIX) of increasing the number of channels. In one embodiment of this step, for the channel of a mono signal , it consists in changing the signal by different responses room spatial impulses (SRIR for “Spatial Room Impulse Response”); these SRIRs are defined in the original ambisonic order of B. Other decorrelation methods are possible, for example the application of all-pass decorrelator filters to the different channels of the signal.

The block 372 implements an optional step (SB) of division into sub-bands to obtain either sub-bands in the time domain or in a transformed domain. A reverse step, in block 391, groups the sub-bands to find a multichannel signal at the output.

Block 375 determines (Inf ) information representative of a spatial image of the decoded multichannel signal in a manner similar to that described for block 321 (for the original multichannel signal), this time applied to the decoded multichannel signal obtained at the output of the block 371 or block 370 depending on the embodiments decoding.

In the same way as what has been described for the block 321, in one embodiment, this information is energy information associated with directions of origin of the sound (associated with the directions of virtual loudspeakers distributed over a unit sphere). As explained above, an SRP (or other) type method can be used to determine the spatial image of the decoded multichannel signal. In another embodiment, this information is a covariance matrix of the channels of the decoded multichannel signal. This covariance matrix is then obtained as follows:

(real case) or (complex case) up to a normalization factor.

From the information representative of the spatial images respectively of the original multichannel signal (Inf. B) and of the decoded multichannel signal (Inf. ), for example, the covariance matrices C and block 380 implements the method of determination (Det.Corr) of a set of corrections as described with reference to FIG. 2.

Two particular embodiments of this determination are described with reference to FIGS. 4 and 5.

In the embodiment of FIG. 4, a method using rendering (explicit or not) on a virtual loudspeaker is used and in the embodiment of FIG. 5, a method implemented based on a factorization of the Cholesky type is used.

Block 390 of Figure 3 implements a correction (CORR) of the multichannel signal decoded by the set of corrections determined by block 380 to obtain a corrected decoded multichannel signal.

FIG. 4 therefore represents an embodiment of the step of determining a set of corrections. This embodiment is accomplished through the use of virtual speaker rendering.

In this embodiment, it is initially considered that the information representative of the spatial image of the original multichannel signal and of the decoded multichannel signal are the respective covariance matrices C and

In this case, blocks 420 and 421 respectively determine the spatial images of the original multichannel signal and the decoded multichannel signal.

To do this, as was done previously, we discretize a virtual 3D sphere of unit radius, by N points (“point” virtual speakers) whose direction is defined in spherical coordinates by the directions (Θ _n , φ _n ) for the nth speaker.

Several discretization methods have been defined above.

From this discretization, we can determine the spatial image of the multichannel signal. As previously mentioned, one possible method is the SRP (or other) method which consists in calculating the short-term energy coming from different directions defined in terms of azimuth and elevation.

Debt method or other types of methods as listed previously can be used to determine the spatial images ∑ and

) respectively of the original multichannel signal, in 420 (IMG B) and of the multichannel signal decoded in 421 (IMG ). In the case where the information representative of the spatial image of the original signal (Inf B) received and decoded in 360 by the decoder is the spatial image itself, that is to say information of energy (or a positive quantity) associated with directions of origin of the sound (associated with directions of virtual loudspeakers distributed over a unit sphere), it is then no longer necessary to calculate it at 420. This spatial image is then used directly by block 430 described below.

Likewise, if the determination at 375 of the information representative of the spatial image of the decoded multichannel signal (I nf ) is the spatial image itself of the decoded multichannel signal, then it is no longer necessary to calculate it at 421. This spatial image is then used directly by block 430 described below.

From the spatial images ∑ and block 430 calculates (Ratio) for each point given by (Θ _n , φ _n ), the energy ratio between the energy σ _π ² = ∑ _n of the original signal and the energy of the decoded signal. A set of gains g _n is thus obtained according to the following equation:

The energy ratio, depending on the direction (Θ _n , φ _n ) and the frequency band, can be very important. Block 440 optionally makes it possible to limit (Limit g _n ) the maximum value that a gain g _n can take. It is recalled here that the positive quantities noted σ _π ² and can correspond more generally to quantities resulting from of a MUSIC pseudo-spectrum or of the values resulting from a histogram of directions of arrival according to the discretized directions (Θ _n , φ _n ).

In one possible embodiment, a threshold is applied to the value of g _n . Any value greater than this threshold is forced to be equal to this threshold value. The threshold can be for example fixed at 6 dB, so that a gain value outside the range ±

6 dB is saturated to ± 6 dB.

This set of gains g _n therefore constitutes the set of corrections to be made to the decoded multichannel signal. This set of gains is received at the input of the correction block 390 of FIG. 3. A correction matrix directly applicable to the decoded multichannel signal can be defined, for example in the form G = E.diag ([g ₀ ... g _N-1 ]). D where D and E are the acoustic decoding and encoding matrices defined previously This matrix G is applied to the decoded multichannel signal to obtain the signal corrected output ambisonics (corr).

A breakdown of the steps implemented for the correction is now described. Block 390 applies, for each virtual loudspeaker, the _{corresponding gain g n} , determined previously. The application of this gain makes it possible to obtain, on this loudspeaker, the same energy as the original signal.

The rendering on each loudspeaker of the decoded signals is thus corrected.

An acoustic encoding step, for example ambisonic encoding by the matrix E, is then implemented to obtain components of the multichannel signal, for example ambisonic components. These ambisonic components are finally summed to obtain the multichannel output signal, corrected (Corr). It is therefore possible to calculate explicitly the channels associated with the virtual loudspeakers, to apply a gain to them, then to recombine the processed channels, or in an equivalent manner to apply the matrix G to the signal to be corrected.

In variants, from the covariance matrix of the encoded multichannel signal then decoded and from the correction matrix G one can calculate in block 390 the covariance matrix of the corrected signal as:

Only the value of the first coefficient R ₀₀ of the matrix R, corresponding to the omnidirectional component (channel W), is kept to be applied as a normalization factor to R and to avoid an increase in the overall gain due to the correction matrix G: where corresponds to the first coefficient of the covariance matrix of the decoded multichannel signal.

In variants, the normalization factor g _norm can be determined without calculating the entire matrix R, because it suffices to calculate only a subset of matrix elements to determine R ₀₀ and therefore g _norm ).

The matrix G or G _norm rm thus obtained corresponds to the set of corrections to be made to the decoded multichannel signal.

Figure 5 now shows another embodiment of the method for determining the set of corrections implemented in block 380 of Figure

3.

In this embodiment, it is considered that the information representative of the spatial image of the original multichannel signal and of the decoded multichannel signal are the respective covariance matrices C and

In this embodiment, no attempt is made to render on virtual speakers in order to correct the spatial image of a multichannel signal. In particular, for an ambisonic signal, we seek to calculate the correction of the spatial image directly in the ambisonic domain.

For this, a transformation matrix T to be applied to the decoded signal is determined, so that the spatial image modified after application of the transformation matrix T to the decoded signal is the same as that of the original signal B. We are therefore looking for a matrix T which satisfies the following equation: where C = BB ^T is the covariance matrix of B and is the covariance matrix of , in the current frame.

In this embodiment, a factorization known as Cholesky factorization is used to solve this equation.

Given a matrix A of size nxn, the Cholesky factorization consists in determining a triangular matrix L (lower or higher) such that A = LL ^T (real case) and A = LL ^H (complex case). For the decomposition to be possible, the matrix A must be a positive definite symmetric matrix (real case) or a definite Hermitian matrix. positive (complex case); in the real case, the diagonal coefficients of L are strictly positive.

In the real case, a matrix M size nxn is said to be symmetric positive definite if it is symmetric (M ^T = M) and positive definite (x ^T Mx> 0 for all).

For a symmetric matrix M, it is possible to check that the matrix is positive definite if all its eigenvalues are strictly positive (λ _i > 0). If the eigenvalues are positive (λl ≥ 0), the matrix is said to be positive semi-definite.

A matrix M size nxn is said to be symmetric Hermitian positive definite if it is Hermitian (M ^H = M) and positive definite (z ^H Mz is a real> 0 for all ).

Cholesky factorization is for example used to find a solution to a system of linear equations of the type Ax = b. For example, in the complex case, it is possible to transform A into LL ^H by the Cholesky factorization, to solve Ly = b then to solve L ^H x = y.

Equivalently, the Cholesky factorization can be written as A = U ^T U (real case) and A = U ^H U (complex case), where U is an upper triangular matrix.

In the embodiment described here, without loss of generality, we only deal with the case of a Cholesky factorization by triangular matrix L.

Thus, the factorization of Cholesky makes it possible to decompose a matrix C = LL ^T into two triangular matrices on the condition that the matrix C is symmetrically positive definite. This gives the following equation:

25

By identification, we find:

Is :

Since the covariance matrices C are generally positive semi-defined matrices, the Cholesky factorization cannot be used as is. We note here when the matrices L and are lower triangular (respectively upper), the transformation matrix T is also lower triangular (respectively upper).

Thus, block 510 forces the covariance matrix C to be positive definite. For this, a value ε is added (Fact. C for factorization of C) on the coefficients of the diagonal of the matrix to guarantee that the matrix is well defined positive: C = C + εI, where ε is a low value fixed by example at 10 ^-9 and I is the identity matrix. Similarly, block 520 forces the covariance matrix to be positive definite, by modifying this matrix in the form, where ε is a weak value set for example at 10 ^-9 and I is the identity matrix.

Once the two covariance matrices C and are conditioned to be positive, block 530 calculates the associated Cholesky factorizations and finds (Det.T) the optimal transformation matrix T in the form

In variants, an alternative resolution can be made with an eigenvalue decomposition.

The decomposition into eigenvalues ("eigen decomposition" in English) consists in factoring a real or complex matrix A of size n x n in the form:

A = Q Λ Q ^-1 where A is a diagonal matrix containing the eigenvalues λ _i and Q is the matrix of eigenvectors.

If the matrix is real:

A = Q Λ Q ^T

In the complex case, the decomposition is written: A = QΛQ ^H

In the present case, one then seeks a matrix T such as: where C = Q Λ Q ^t and is :

By identification we find:

Is :

The stability of the solution from one frame to another is typically poorer than with a Cholesky factorization approach. To this instability are added larger approximations of calculation potentially larger during the decomposition into eigenvalues.

In variants the calculation of the diagonal matrix or can be done element by element in the form where sgn (.) is a sign function (+1 if positive, -1 otherwise) and ε is a regularization term (eg ε = 10 ⁹ ) to avoid divisions by zero.

In this embodiment, it is possible that the relative difference in energy between the decoded ambisonic signal and the corrected ambisonic signal is very large, especially at the high frequencies which can be greatly deteriorated by encoders such as multi-EVS coding. -mono. To avoid excessively amplifying certain frequency zones, a regularization term can be added. Block 640 optionally takes care of normalizing (Norm. T) this correction.

In the preferred embodiment, a normalization factor is therefore calculated so as not to amplify frequency zones. From the covariance matrix of the encoded then decoded multichannel signal and from the transformation matrix T we can calculate the covariance matrix of the corrected signal as:

Only the value of the first coefficient R ₀₀ of the matrix R, corresponding to the omnidirectional component (channel W), is kept to be applied as a normalization factor to T and to avoid an increase in the overall gain due to the correction matrix T: with where corresponds to the first coefficient of the covariance matrix of the decoded multichannel signal.

In variants, the normalization factor g _norm can be determined without calculating the entire matrix R, because it suffices to calculate only a subset of matrix elements to determine R ₀₀ (and therefore g _norm ).

The matrix T or T _norm thus obtained corresponds to the set of corrections to be made to the decoded multichannel signal.

With this embodiment, the block 390 of FIG. 3 performs the step of correcting the decoded multichannel signal by applying the transformation matrix T or T _norm directly to the decoded multichannel signal, in the ambisonic domain, to obtain the ambisonic signal of output corrected (corr).

A second embodiment of an encoder / decoder according to the invention will now be described in which the method for determining the set of corrections is implemented at the encoder. Figure 6 describes this embodiment. This figure therefore represents a second embodiment of an encoding device and of a decoding device for the implementation of a coding and decoding method. including a method for determining a set of corrections as described with reference to FIG. 2.

In this embodiment, the method of determining the set of corrections (for example the gains associated with the directions) is carried out to the encoder which then transmits this set of corrections to the decoder. The decoder decodes this set of corrections to apply it to the decoded multichannel signal. Oe embodiment therefore involves implementing a local decoding at the encoder, this local decoding is represented by blocks 612 to 613.

The blocks 610, 611, 620 and 621 are identical respectively to the blocks 310, 311, 320 and 321 described with reference to FIG. 3.

Thus, at the output of block 621, information representative of the spatial image of the original multichannel signal (Inf. B) is obtained.

Block 612 implements local decoding (DEc_loc) in connection with the coding performed by block 611.

The local decoding can consist of a complete decoding from the binary train coming from the block 611 or, preferably, it can be integrated into the block 611.

In one embodiment of the encoding and decoding, not implementing the downmix and upmix steps, the decoded multichannel signal is obtained at the output of local decoding block 612.

In the embodiment where the downmix step in 610 has been used for encoding, the local decoding implemented in block 612 makes it possible to obtain a decoded audio signal which is sent as input to block 613 of upmix.

Thus, block 613 implements an optional step (UPMIX) of increasing the number of channels. In one embodiment of this step, for the channel of a mono signal , it consists in convolving the signal by different room spatial impulse responses (SRIR for “Spatial Room Impulse Response”); these SRIRs are defined in the original ambisonic order of B. Other decorrelation methods are possible, for example the application of all-pass decorrelator filters to the different channels of the signal.

The block 614 implements an optional step (SB) of division into sub-bands to obtain either sub-bands in the time domain or in a transformed domain. Block 615 determines (Inf) information representative of a spatial image of the decoded multichannel signal similarly to what has been described for blocks 621 and 321 (for the original multichannel signal), applied this time. to the decoded multichannel signal obtained at the output of block 612 or of block 613 according to the modes for performing local decoding. This block 615 is equivalent to block 375 of figure

3.

In the same way as for blocks 621 and 321, in one embodiment, this information is energy information associated with directions of origin of sound (associated with directions of virtual speakers distributed over a unit sphere) . As explained above, an SRP or other type method (such as the variants described above) can be used to determine the spatial image of the decoded multichannel signal.

In another embodiment, this information is a covariance matrix of the channels of the decoded multichannel signal. This covariance matrix is then obtained as follows: up to a normalization factor (in the real case) or up to a normalization factor (in the complex case)

From the information representative of the spatial images respectively of the original multichannel signal (Inf. B) and of the decoded multichannel signal (Inf. ), for example, the covariance matrices C and , block 680 implements the method for determining (Det.Gorr) a set of corrections as described with reference to FIG. 2.

Two particular embodiments of this determination are possible and have been described with reference to FIGS. 4 and 5.

In the embodiment of FIG. 4, a method using speaker rendering is used and in the embodiment of FIG. 5, a method implemented directly in the ambisonic domain based on a factorization of the Cholesky type or by eigenvalue decomposition is used.

Thus, if the embodiment of FIG. 4 is applied at 630, the determined set of corrections is a set of gains g _n for a set of directions (Θ _n , φ _n ) defined by a set of virtual loudspeakers. This set of gains can be determined in the form of a correction matrix G as described with reference to FIG. 4.

This set of gains (Gorr.) Is then coded at 640. The coding of this set of gains can consist in coding the correction matrix G or G _norm .

It is noted that the matrix G of size KxK is symmetrical, so according to the invention it is possible to code only the lower or upper triangle of G or G _norm , i.e.

Kx (K + 1) / 2 values. In general, the values on the diagonal are positive. In one embodiment, the coding of the matrix G or G _norm is carried out by scalar quantization (with or without a sign bit) depending on whether the values are outside the diagonal or not. In the variants where G _norm is used, we can omit to code and transmit the first value of the diagonal (corresponding to the omnidirectional component) of G _norm because it is always at 1; for example in the ambisonic case of order 1 to

K = 4 channels this amounts to transmitting only 9 values instead of Kx (K + 1) / 2 = 10 values. In variants, other scalar or vector quantization methods (with or without prediction) could be used.

If the embodiment of FIG. 5 is applied at 630, the determined set of corrections is a transformation matrix T or T _norm which is then coded at 640.

It is noted that the matrix T of size KxK is triangular in the variant using Cholesky factorization and symmetric in the variant using the eigenvalue decomposition; thus according to the invention it is possible to code only the lower or upper triangle of T or T _norm , ie Kx (K + 1) / 2 values.

In general, the values on the diagonal are positive. In one embodiment, the coding of the T or T _norm matrix is performed by scalar quantization (with or without a sign bit) depending on whether the values are outside the diagonal or not. In variants, other scalar or vector quantization methods (with or without prediction) could be used. In the variants where T _norm is used, we can omit coding and transmit the first value of the diagonal (corresponding to the omnidirectional component) of T _norm because it is always at 1; for example in the ambisonic case of order 1 with K = 4 channels, this amounts to transmitting only 9 values instead of Kx (Κ + 1) / 2 = 10 values.

Block 640 thus encodes the determined set of corrections and sends the encoded set of corrections to multiplexer 650.

The decoder receives in the demultiplexer block 660, a bit stream comprising an encoded audio signal from the original multichannel signal and the encoded set of corrections to be applied to the decoded multichannel signal.

Block 670 decodes (Q ^-1 ) the encoded set of corrections. Block 680 decodes (DEC) the encoded audio signal received in the stream.

In one embodiment of the encoding and decoding, not implementing the downmix and upmix steps, the decoded multichannel signal is obtained at the output of decoding block 680.

In the embodiment where the downmix step has been used in encoding, the decoding implemented in block 680 provides a decoded audio signal which is input to upmix block 681.

Thus, block 681 implements an optional step (UPMIX) of increasing the number of channels. In one embodiment of this step, for the channel of a mono signal, it consists in convolving the signal by different responses room spatial impulses (SRIR for “Spatial Room Impulse Response”); these SRIRs are defined in the original ambisonic order of B. Other decorrelation methods are possible, for example the application of all-pass decorrelator filters to the different channels of the signal.

The block 682 implements an optional step (SB) of division into sub-bands to obtain either sub-bands in the time domain or in a transformed domain and the block 691 groups the sub-bands to find the output multichannel signal .

Block 690 implements a correction (CORR) of the multi-channel signal decoded by the set of corrections decoded at block 670 to obtain a corrected decoded multi-channel signal (Corr). In one embodiment where the set of corrections is a set of gains as described with reference to FIG. 4, this set of gains is received at the input of the correction block 690.

S the set of gains is in the form of a correction matrix directly applicable to the decoded multichannel signal, defined, for example in the form

G = E.diag ([g ₀ ... g _N-1 ]). D or G _norm = g _norm .G, this matrix G or G _norm is then applied to the decoded multichannel signal S to obtain the ambisonic output signal corrected (Corr).

If the block 690 receives a set of gains g _n , the block 690 applies for each virtual loudspeaker, the corresponding _{gain g n.} The application of this gain makes it possible to obtain, on this loudspeaker, the same energy as the original signal.

The rendering on each loudspeaker of the decoded signals is thus corrected.

An acoustic encoding step, for example ambisonic encoding, is then implemented to obtain components of the multichannel signal, for example ambisonic components. These ambisonic components are then summed to obtain the multichannel output signal, corrected (Corr).

In an embodiment where the set of corrections is a transformation matrix as described with reference to FIG. 5, the transformation matrix T decoded at 670 is received at the input of the correction block 690.

With this embodiment, block 690 performs the step of correcting the decoded multichannel signal by applying the T or T _norm transformation matrix directly to the decoded multichannel signal, in the ambisonic domain, to obtain the corrected ambisonic output signal ( Corr).

Even if the invention applies to the ambisonic case, in variants it is possible to convert other formats (multichannel, object, etc.) into ambisonic in order to apply the methods implemented according to the various embodiments described. An exemplary embodiment of such a conversion from a multichannel or object format to an ambisonic format is described in figure 2 of the 3GPP TS 26.259 (v15.0.0) specification.

FIG. 7 shows a DCOD encoding device and a DDEC decoding device; within the meaning of the invention, these devices being dual from each other (in the sense of “reversible”) and connected to each other by a communication network RES. The DCOD coding device comprises a processing circuit typically including:

- A memory ΜEM1 for storing instruction data of a computer program within the meaning of the invention (these instructions can be distributed between the DOOD encoder and the DDEC decoder);

- an interface INT1 for receiving an original multichannel signal B, for example an ambisonic signal distributed over different channels (for example four channels W, Y, Z, X at order 1) with a view to its coding in compression within the meaning of the invention;

a processor PROC1 for receiving this signal and processing it by executing the computer program instructions stored in the memory ΜBM1, with a view to its coding; and

- a COM 1 communication interface for transmitting the coded signals via the network.

The DDEC decoding device comprises its own processing circuit, typically including:

- A memory ΜEM2 for storing instruction data of a computer program within the meaning of the invention (these instructions can be distributed between the DOOD encoder and the DDEC decoder as indicated above);

- a COM2 interface for receiving the encoded signals from the RES network with a view to their compression decoding within the meaning of the invention;

a PAOC2 processor for processing these signals by executing the computer program instructions stored in the memory ΜEM2, with a view to their decoding; and

- an output interface INT2 to deliver the corrected decoded signals (Corr) for example in the form of ambisonic channels W..X, with a view to their reproduction.

Of course, this FIG. 7 illustrates an example of a structural embodiment of a codec (encoder or decoder) within the meaning of the invention. Figures 3 to 6 commented above describe in detail rather functional embodiments of these coded.

Claims

1. Method of determining a set of corrections (Corr.) To be made to a multichannel sound signal, in which the set of corrections is determined from information representative of a spatial image of an original multichannel signal (Inf. B) and information representative of a spatial image of the original multichannel signal encoded then decoded (Inf. ).

2. The method of claim 1, wherein the determination of the set of corrections is performed by frequency sub-band.

3. A method of decoding a multichannel sound signal, comprising the following steps:

- reception (350) of a binary stream comprising an encoded audio signal from an original multichannel signal and information representative of a spatial image of the original multichannel signal;

- decoding (370) of the received encoded audio signal and obtaining a decoded multichannel signal;

- decoding (360) of information representative of a spatial image of the original multichannel signal;

- determination (375) of information representative of a spatial image of the decoded multichannel signal;

- determination (380) of a set of corrections to be made to the decoded signal according to the determination method according to one of claims 1 to 2;

- correction (390) of the multichannel signal decoded by the determined set of corrections.

4. Method for encoding a multichannel sound signal, comprising the following steps:

- encoding (611) of an audio signal from an original multichannel signal;

- determination (621) of information representative of a spatial image of the original multichannel signal;

- local decoding (612) of the encoded audio signal and obtaining a decoded multichannel signal;

- determination (615) of information representative of a spatial image of the decoded multichannel signal;

- determination (630) of a set of corrections to be made to the signal multichannel decoded according to the determination method according to one of claims 1 to 2;

- coding (640) of the determined set of corrections.

5. A decoding method according to claim 3 or an encoding method according to claim 4, in which the information representative of a spatial image is a covariance matrix and the determination of the set of corrections further comprises the following steps:

- determining a spatial image of the original multichannel signal from the weighting matrix obtained and from the covariance matrix of the original multichannel signal;

6. The decoding method according to claim 3, wherein the information representative of a spatial image of the original multichannel signal received is the spatial image of the original multichannel signal and the determination of the set of corrections comprises in in addition to the following steps:

7. The decoding method according to claim 3 or the encoding method according to claim 4, in which the information representative of a spatial image is a covariance matrix and the determination of the set of corrections comprises a step of determining the a matrix of transformation by matrix decomposition of the two covariance matrices, the transformation matrix constituting the set of corrections.

8. The decoding method according to one of claims 5 to 7, wherein the correction of the multi-channel signal decoded by the determined set of corrections is performed by applying all of the corrections to the decoded multi-channel signal.

9. Decoding recipe according to one of claims 5 to 6, wherein the correction of the multichannel signal decoded by the determined set of corrections is carried out according to the following steps:

- acoustic decoding of the decoded multichannel signal on the defined set of virtual loudspeakers;

10. A method of decoding a multichannel sound signal, comprising the following steps:

- reception of a binary stream comprising an encoded audio signal originating from an original multichannel signal and a coded set of corrections to be made to the decoded multichannel signal, the set of corrections having been coded according to a coding method in accordance with the one of claims 4, 5 or 7;

- decoding of the coded set of corrections;

11. Multichannel sound signal decoding method, comprising the following steps:

- reception of a binary stream comprising an encoded audio signal originating from an original multichannel signal and a coded set of corrections to be made to the decoded multichannel signal, the set of corrections having been coded according to a coding method according to claim 5;

- decoding of the received coded audio signal and obtaining a decoded multichannel signal;

- decoding of the coded set of corrections;

. acoustic decoding of the decoded multichannel signal on the set of virtual loudspeakers;

12. Decoding device comprising a processing circuit for implementing the decoding method according to one of claims 3 or 5 to 11.

13. Coding device comprising a processing circuit for implementing the coding method according to one of claims 4, 5 or 7.

14. Storage medium, readable by a processor, storing a computer program comprising instructions for the execution of the decoding method according to one of claims 3 or 5 to 11 or of the encoding method according to one of claims 4, 5 or 7.