EP3025514A1

EP3025514A1 - Sound spatialization with room effect

Info

Publication number: EP3025514A1
Application number: EP14748239.2A
Authority: EP
Inventors: Grégory PALLONE; Marc Emerit
Original assignee: Orange SA
Current assignee: Orange SA
Priority date: 2013-07-24
Filing date: 2014-07-04
Publication date: 2016-06-01
Anticipated expiration: 2034-07-04
Also published as: FR3009158A1; KR20210008952A; US20160174013A1; KR102206572B1; KR102310859B1; WO2015011359A1; ES2754245T3; CN105684465B; JP6486351B2; JP2016527815A; EP3025514B1; CN105684465A; KR20160034942A; US9848274B2

Abstract

The invention relates to a method of sound spatialization, in which at least one filtering process, including summation, is applied, to at least two input signals (I(1), I(2), , I(L)), the filtering process comprising: - the application of at least one first room effect transfer function (A^k(1), A^k(2),... , A^k(L)), the first transfer function being specific to each input signal, and the application of at least one second room effect transfer function (B_mean ^k), the second transfer function being common to all input signals. The method is such that it comprises a step of weighting at least one input signal with a weighting factor (W ^k (l)), said weighting factor being specific to each of the input signals.

Description

Sound Spatialization with Hall Effect

The invention relates to the processing of sound data, and more particularly to the spatialization (called "3D rendering") of audio signals.

Such an operation is for example performed when decoding a coded 3D audio signal, represented on a number of channels, to a number of different channels, for example two, to allow the reproduction of the 3D audio effects on a headset. listening.

The invention also relates to the transmission and reproduction of multichannel audio signals and their conversion to a rendering device, transducer, imposed by the equipment of a user. This is for example the case for the reproduction of a 5.1 sound stage by an audio headset, or by a pair of loudspeakers.

The invention also relates to the rendering, in the context of a game or video recording, for example, of one or more sound samples stored in files, with a view to their spatialization.

In the case of a static monophonic source, the binauralization is based on the monophonic signal filtering by the transfer function between the desired position of the source and each of the two ears. The binaural signal (two channels) obtained can then feed a headphone and provide the listener with a feeling of the source at the simulated position. Thus, the term "binaural" refers to the reproduction of a sound signal with spatialization effects.

Each of the transfer functions simulating different positions can be measured in a deaf chamber, thus resulting in a set of HRTFs (for "Head Related Transfer Functions") in which no room effect is present.

These transfer functions can also be measured in a "classical" room, resulting in a set of BRIRs ("Binaural Room Impulse Response") in which the room effect, or reverb, is present. The set of BRIRs thus correspond to a set of transfer functions between a given position and the ears of a listener (real or artificial head) placed in a room.

The usual BRIR measurement technique consists of successively sending in each of the actual loudspeakers, positioned around a head (real or artificial) equipped with microphones in the ears, a test signal (for example a sweep signal, a sequence pseudo-random binary or white noise). This test signal makes it possible, during a non-real-time processing, to reconstitute (generally by deconvolution) the impulse response between the position of the loudspeaker and each of the two ears. The difference between a set of HRTF and BRIR lies mainly in the length of the impulse response, of the order of one millisecond for the HRTF, to the order of one second for the BRIRs.

Since the filtering is based on the convolution between the monophonic signal and the impulse response, the complexity of binauralizing with BRIRs (containing a room effect) is much higher than with HRTFs.

It is possible by this technique to simulate the headphones or a limited number of speakers listening to a multichannel content (L channels) generated by L speakers in a room. Indeed, it suffices to consider each of the L loudspeakers as a virtual source ideally positioned relative to the listener, to measure in the room to simulate the transfer functions (for the left and right ears) of each of these L speakers, then apply to each of the L audio signals (supposed to supply the L actual speakers) BRIR filters corresponding to the speakers. The signals feeding each of the ears are summed to provide a binaural signal feeding an audio headset. We note 1 (1) (with 1 = [1, L]) the input signal supposed to supply L loudspeakers. We write BRIR ^gd (l), the BRIRs of each of the speakers for each of the two ears, and we write O ^gd the binaural output signal. The binauralization of the multichannel signal is therefore written:

Qs = ^ BRIR * {ï)

0 = T (1) * BRIB ^d (ï) Where * represents the convolution operator.

Subsequently, the index 1 such that l £ [i. i-j refers to one of the L speakers. We have a BRIR for a signal 1.

Thus, with reference to FIG. 1, two convolutions (one for each ear) are present for each loudspeaker (steps SU to S IL). For loudspeakers, binauralization therefore requires 2.L convolutions. The complexity C _conv can be calculated in the case of a fast block implementation. A fast implementation by block is for example given by a Fast Fourier Transform (FFT). The document "Submission and Evaluation Procedures for 3D Audio" specifies a possible formula for the calculation of C _conv : € _Bt>m> = (L + 2}. (NBIocs). (E, log ₂ (2Fs / 7iSio s})

In this equation, L represents the number of FFTs to frequency transform the input signals (1 FFT per input signal), the 2 represents the number of inverse FFTs to obtain the time binaural signal (2 inverse FFTs for both binaural channels), the 6 indicates a coefficient of complexity per FFT, the second 2 indicates a zero stuffing necessary to avoid the problems due to the circular convolution, Fs indicates the size of each of the BRIRs, and nBlocs represents the fact of use block processing, more realistic in an approach where latency should not be excessively high, and. represents multiplication.

Thus for typical use with nBlocs = 10, Fs = 48000, L = 22, the multichannel signal sample complexity for a FFT based direct convolution is C _with v = 19049 multiplications-additions.

This complexity is too high for a realistic implementation at present on current processors (mobile for example), it is therefore necessary to reduce this complexity without greatly degrading the rendering binauralization. For spatialization to be of good quality, the entire time signal of the BRIRs must be applied.

The present invention improves the situation.

It aims to greatly reduce the complexity of binauralizing a multichannel signal with room effect while maintaining the best audio quality.

The present invention proposes for this purpose a sound spatialization method, in which at least one filtering, with summation, is applied to at least two input signals (1 (1), 1 (2), I (L)), the filtering comprising: applying at least one first room effect transfer function (A ^k (1), A ^k (2),..., A ^k (L)), this first transfer function being specific to each input signal, and the application of at least one second room effect transfer function (B _mean ^k ), this second transfer function being common to all the input signals. The method is such that it comprises a step of weighting at least one input signal with a weighting weight (W ^k (1)), said weighting weight being specific to each of the input signals. The input signals correspond for example to the different channels of a multichannel signal. Such a filtering can in particular deliver at least two output signals intended for restitution. spatialized (in binaural or transaural, or in ambiophonic restitution involving more than two output signals). In a particular embodiment, the filter delivers precisely two output signals, the first output signal being spatialized for the left ear and the second output signal being spatialized for the right ear. This makes possible the preservation of the degree of natural correlation that can exist between the left and right ears at low frequencies.

The physical properties (for example the energy or the correlation between the various transfer functions) of the transfer functions over certain time intervals make simplifications possible. At these intervals, the transfer functions can be approximated by an average filter.

The application of the room effect transfer functions is therefore advantageously compartmentalized over these intervals. At least one first transfer function specific to each input signal may be applied for intervals where it is not possible to make approximations. At least one second transfer function approximated to an average filter may be applied for intervals where it is possible to make approximations.

The application of a single transfer function common to each of the input signals substantially reduces the number of calculations to be performed for the spatialization. The complexity of this spatialization is therefore advantageously reduced. This simplification thus advantageously reduces the processing time while minimizing the CPU or processors used for these calculations. Moreover, with weighting weights specific to each of the input signals, the energy differences between the different input signals can be taken into account even if the treatment applied to them is partly approximated by an average filter.

In a particular embodiment, the first and second transfer functions are respectively representative of: direct sound propagation and first sound reflections of these propagations; and

a diffuse sound field present after these first reflections, and the method in the sense of the invention furthermore comprises: the application of first transfer functions respectively specific to the input signals, and the application of a second transfer function, identical for all the input signals, and resulting from an overall approximation of a diffuse sound field effect. Thus, the complexity of the treatment is advantageously reduced by this approximation. In addition, the influence of such an approximation on the quality of the processing is reduced because this approximation is related to the effects of diffuse sound field and not to direct sound propagation. These effects of diffuse sound field are indeed less sensitive to approximations. The first sound reflections are typically a first succession of echoes of the sound wave. In an exemplary practical embodiment, it is considered that these first reflections are two in number, at most.

In another embodiment, a preliminary step of constructing the first and second transfer functions from impulse responses incorporating a room effect, comprises, for the construction of a first transfer function, the operations:

determining a moment of beginning of presence of direct sound waves,

determining a moment of beginning of presence of the diffuse sound field after the first reflections, and

selecting, in an impulse response, a portion of the response that extends temporally between the instant of onset of the presence of direct sound waves up to the moment of onset of diffuse field presence; selected response corresponding to the first transfer function.

In a particular embodiment, the start time of diffuse field presence is determined from predetermined criteria. In a possible embodiment, the detection of a monotonic decay of a spectral density of sound power in a given room can typically characterize the beginning of presence of the diffuse field, and hence give the moment of beginning of presence diffuse field.

In a variant, the moment of beginning of presence can be determined by an estimate according to the characteristics of the room, for example simply from the volume of the room as will be seen later.

As a variant, in a simpler exemplary embodiment, it can be considered that if an impulse response extends over N samples, then the start time of presence of the diffuse field occurs, for example, after N / 2 samples of the impulse response. Thus, the presence start time is predetermined and therefore corresponds to a fixed value. Typically, this value may correspond, for example, to the ^2048th sample on 48000 samples of an impulse response incorporating a room effect.

The instant of onset of the presence of direct sound waves, mentioned above, may correspond, for example, to the beginning of the time signal of an impulse response with room effect. In a complementary embodiment, the second transfer function is constructed from a set of impulse response portions beginning temporally after the start time of the presence of the diffuse field.

In a variant, the second transfer function can be determined from the characteristics of the room, or from predetermined standard filters.

Thus, the impulse responses incorporating a room effect are advantageously divided into two parts separated by a start of presence time. Such a separation makes possible a treatment adapted to each of these parts. For example, a selection of the first samples (the first 2048) of an impulse response can be used to use it as the first transfer function in the filtering and then ignore the remaining samples (from 2048 to 48000 for example) or the average with those of other impulse responses.

The advantage of such an embodiment is then, particularly advantageously, to simplify the filtering calculations specific to the input signals, and to add a form of noise from the sound diffusion that can be calculated from the second halves impulse responses (in the form of an average, for example as will be seen later), or simply from a predetermined impulse response, estimated simply as a function of characteristics of the given room (its volume, the walls of the walls of the room). room, or others), or a standard room.

In another variant, the second transfer function is given by applying a formula of the type: with k the index relating to an output signal, ί E fl; Lj the index relating to an input signal,

L the number of input signals, normalized transfer method obtained from a set of impulse response portions beginning temporally after the start time of presence of the diffuse field.

In one embodiment, the first and second transfer functions are derived from a plurality of binaural BRIR room impulse responses. In another embodiment, these first and second transfer functions are obtained from experimental values derived from measurement of propagations and reverberations in a given room. Thus, the treatment is carried out from experimental data. Such data translate very precisely the room effects and thus guarantee a great realism of the rendering.

In another embodiment, the first and second transfer functions are obtained from reference filters, synthesized for example with a network of curly delays.

In one embodiment, truncation is applied at the beginning of the BRIRs. Thus, the first samples of BRIR for which the application to the input signals has no influence are advantageously eliminated.

In another particular embodiment, a BRIR start truncation compensation delay is applied. This compensation time makes it possible to compensate for the time offset introduced by the truncation.

In another embodiment, truncation is applied at the end of BRIR. Thus, the last samples of BRIR for which the application to the input signals has no influence are advantageously eliminated.

In one embodiment, the filtering comprises the application of at least one compensation delay corresponding to a time difference between the aforementioned instant of the start of direct sound waves and the start time of presence of diffuse field. Thus, the delays that can be introduced by the application of time-shifted transfer functions are advantageously compensated.

In another embodiment, the first and second room effect transfer functions are applied parallel to the input signals. In addition, at least one compensation delay is applied to the input signals filtered by the second transfer functions. Thus, simultaneous processing of these two transfer functions is possible for each of the input signals. Such treatment advantageously reduces the processing time for the implementation of the present invention.

In a particular embodiment, an energy compensation gain is applied to the weighting weights. Thus at least one input signal is applied, at least one energy compensation gain. Thus, the output amplitude is advantageously normalized. This energy compensation gain makes it possible to respect the energy of the binauralized signals. It corrects the energy of the binauralized signals according to the degree of correlation of the input signals

In a particular embodiment, the energy compensation gain is a function of the correlation between the input signals. Thus, the correlation between signals is advantageously taken into account.

In one embodiment, at least one output signal is given by applying a formula of the type:

with k the index relating to an output signal,

0 ^k an output signal,

1%; Ij the index relating an input signal among the input signals, L the number of input signals,

1 (1) an input signal among the input signals,

A ^k {1) a room effect transfer function among the first room effect transfer functions,

¾s transfer function with room effect among the second transfer functions with room effect,

W ^k i) a weight weighting among the weighting weight, s - TDD _Corres _0n p _(j _is the application period for compensation, which is multiplication, and where * is the convolution operator..

In another embodiment, a decorrelation step is applied to the input signals prior to the application of the second transfer functions. In this embodiment, at least one output signal is thus obtained by applying a formula of the type:

I _d (l) a decorrelated input signal among said input signals, the other values being those defined above. Thus, the energy differences due to the energy differences between the correlated signal additions and the decorrelated signal additions can be taken into account. In a particular embodiment, the decorrelation is applied prior to filtering. Thus, it is possible to dispense with energy compensation steps during the filtering.

In one embodiment, at least one output signal is obtained by applying a formula of the type: Λ.¾- _* o

with G (I (1)) the determined energy compensation gain, the other values being those defined above. In a variant, G does not depend on 1 (1).

In one embodiment, the weight for the weighting is given by applying a formula of the type: with k the index relating to an output signal, l € fl; Ij the index relating an input signal among the input signals, L the number of input signals, with S _Bm k the energy of a transfer function with room effect among the second transfer functions with room effect, an energy relative to gain in normalization.

The invention also relates to a computer program comprising instructions for implementing the method described above.

The invention can be implemented by a sound spatialization device, comprising at least one summation filter applied to at least two input signals (1 (1), 1 (2), I (L)), the filter using: at least one first room effect transfer function (A ^k (1), A ^k (2), A ^k (L)), this first transfer function being specific to each input signal, and at least one second a room effect transfer function (B _mean ^k ), this second transfer function being common to all the input signals. The device is such that it comprises weighting modules for weighting at least one input signal with a weighting weight, said weighting weight being specific to each of the input signals.

Such a device can take the physical form of, for example, a processor and possibly a working memory, typically in a communication terminal. The invention can also be implemented in a sound signal decoding module, as input signals, comprising the spatialization device described above.

Other advantages and characteristics of the invention will appear on reading the following detailed description of exemplary embodiments of the invention and on examining the drawings in which: FIG. 1 illustrates a method of spatialization of the FIG. 2 schematically illustrates the steps of a method in the sense of the invention, in an exemplary embodiment, FIG. 3 represents a BRIR binaural impulse response, FIG. 4 schematically illustrates the steps of FIG. a method in the sense of the invention, in an exemplary embodiment, - Figure 5 schematically illustrates the steps of a method in the sense of the invention, in an exemplary embodiment, Figure 6 schematically shows a device comprising means implementation of the method within the meaning of the invention.

Referring to FIG. 6, a possible context for the implementation of the present invention in a TER terminal-connected device (for example a telephone, smartphone or other device, or a connected tablet, a computer connected, or others). Such a device TER comprises reception means (typically an antenna) of audio signals Xc encoded in compression, a decoding device DECOD delivering decoded signals X ready to be processed by a spatialization device before the audio signals are returned (for example by in binaural on a CAS headset). Of course, in some cases, it may be advantageous to keep the partially decoded signals (for example in the field of sub-bands) if the spatialization processing is carried out in the same domain (frequency processing in the field of sub-bands by example). With reference again to FIG. 6, the spatialization device is presented by a combination of elements: hardware typically comprising one or more CIR circuits cooperating with a working memory MEM and a processor PROC, and software, of which FIGS. are examples of flowcharts illustrating the general algorithm.

Here, the cooperation between the hardware and software elements produces a technical effect providing in particular an economy of complexity of the spatialization for substantially the same audio rendering (same sensation for a listener), as will be seen below.

Referring now to Figure 2 to describe a treatment in the sense of the invention, and implemented by computer means.

In a first step S21, a data preparation is performed. This preparation is optional, the signals can be processed according to steps S22 and following without this pre-treatment.

In particular, this preparation consists in truncating each BRIR to ignore the inaudible samples at the beginning and at the end of the impulse response.

This preparation, for the truncation at the beginning of the TRONC S impulse response, in step S211, consists in determining a start time of direct sound waves and can be implemented by the following steps:

A cumulative sum of the energies of each of the BRIR filters (1) is calculated. Typically, this energy is computed by a sum squared of the amplitudes of samples 1 to j, with j included in [1; J] with J the sample number of a BRIR filter.

- The energy value of the maximum energy filter valMax (among the filters relating to the left ear and the right ear) is calculated.

For each of the loudspeakers 1, the index for which the energy of each of the BRIR filters (l) exceeds a certain threshold in dB calculated with respect to valMax (eg valMax-50dB) is calculated. The truncation index iT retained for all BRIRs is the minimum index among all the indices of the BRIRs and is considered as the moment of beginning of direct sound waves.

The index iT obtained therefore corresponds to the number of samples to be ignored for each of the BRIRs. Abrupt truncation at the beginning of an impulse response with a rectangular window can lead to audible artifacts if it is applied in too much energy. It may therefore be preferable to apply a suitable input fade window, however if precautions have been taken in the selected threshold, this windowing becomes useless, because inaudible (just cut the inaudible signal).

The synchronism between BRIR makes it possible to apply a constant delay for all BRIRs for the sake of simplicity of implementation, even if an optimization of complexity is possible.

The truncation of each BRIR to ignore the inaudible samples at the end of the impulse response TRONC E, in step S212, can be performed from steps similar to those described above, adapted to suit the end of the impulse response. Sudden truncation at the end of an impulse response with a rectangular window may lead to audible artifacts on pulse signals where the reverb tail may be audible. Thus, in one embodiment, a suitable output fade window is applied.

In step 22, ISOL A / B synchronism isolation is performed. This isolation in synchronism consists of separating, for each BRIR, the part "direct sound" and "first reflections" (or Direct, noted A) and the part "diffuse sound" (or Diffus, noted B). Indeed, the treatment to be performed on the "diffuse sound" part may advantageously be different from that to be performed on the "direct sound" part, since it is preferable to have a better quality of treatment on the "direct sound" part. Only on the "diffuse sound" part. This makes it possible to optimize the quality / complexity ratio.

In particular, to achieve isolation in synchronism, a single sample "iDD" index common to all BRIRs (hence the term "synchronism") from which the remainder of the impulse response is considered is determined. corresponds to a diffuse field. We therefore partition the BRIR (l) impulse responses into two: A (1) and B (1), where the concatenation of the two corresponds to BRIR (1).

Figure 3 shows the iDD partitioning index at the 2000 sample. The left part of this iDD index corresponds to part A. The right part of this iDD index corresponds to part B. In one embodiment, these two parts are isolated, without windowing, in order to undergo different treatments. In a variant, a windowing between the parts A (1) and B (1) is applied.

The iDD index may be specific to the room for which the BRIRs were determined. The calculation of this index may therefore depend on the spectral envelope, the correlation of the BRIRs or the echogram of these BRIRs. For example, iDD can be determined by a formula of the type iDD = f V _saUs with V _sa ii _e the volume of the measurement room.

In one embodiment, iDD is a fixed value, typically 2000. In one variant, iDD varies, advantageously dynamically, depending on the environment from which the input signals are captured. The output signal for the left (g) and right (d) ears, represented by O ^s' - ^& , is written as follows:

{i) + z

1 = 1 where z ~ ^{t ii D} corresponds to the delay of iDD samples.

The application of this delay to the signals is carried out by storing the values calculated for Σ † _{= i} Hj) * S * ^{/ e} (l) in a temporary memory (for example in a buffer) and restoring them at the desired moment.

In one embodiment, the sample indices selected for A and B may also consider frame lengths in the case of integration into an audio encoder. Indeed, typical frame sizes of 1024 samples can lead to a choice such that A makes 1024 and B makes 2048, making sure that B is a diffuse field area for all BRIRs. In particular, it may be interesting that the size of B is a multiple of the size of A because if the filtering is implemented in blocks of FFT, then the calculation of an FFT for A can be reused for B.

A diffuse field is characterized by the fact that it is statistically identical in all points of the room. Thus, its frequency response varies little depending on the speaker to simulate. The present invention exploits this feature in order to replace all Diffus D (l) filters of all BRIRs with a single and only one "mean" B _mean filter in order to greatly reduce the complexity due to multiple convolutions. For this purpose, it is possible to modify the diffuse field part B in step S23B, again with reference to FIG. 2. At step S23B 1, the value of the mean filter B _mean is calculated. First, it is extremely rare that the complete system is calibrated ideally, so we can apply a weight gain that will be reported in the input signal to perform a single convolution per ear for the diffuse field part. The BRIRs are therefore decomposed into standard energy filters, and the normalization gain is postponed. the input signal:

with B-fL _Q vws * ¹ where E _s sM (Q represents the energy of B ^S * ^{! S} - {1).

Then we approximate ^^^ S ^ ij) by a single filter medium -5 _Mear ^{a- "^} which is no longer based on the speaker 1, but it is also possible to normalize energy:

with _ ± γ->#& ^{, e} _f ?

In one embodiment, this average filter can be obtained by averaging time samples. In a variant, it can be obtained by any other type of averaging such as averaging power spectral densities.

In one embodiment, the energy of the average filter ¾ " _. can be measured directly from the filter built & _MSA ? ^"\ Alternatively, it can also be estimated by taking into account the assumption that Bnoim ³ⁱ filters" ~ i are uncorrelated. Indeed, in this case, as we sum unit energy signals, we have:

The energy can be calculated on all the samples corresponding to the diffuse field part. In step S23B2, the value of the weighting factor W ^{S} a} (ï) is calculated. A single weighting factor to be applied to the input signal is calculated, taking into account the standardizations of the Diffus filters and the average filter:

Since the average filter is constant, it can come out of the sum:

Thus, the L convolutions with the diffuse field portion are replaced by a single convolution with a mean filter, with a weighted sum of the input signal.

In step S23B3, it is possible to calculate a gain G correcting the gain of the average filter & msa. Indeed, in the case of the convolution between the input signals and the unmatched filters, whatever the correlation values between the input signals, the filtering by decorrelated filters that are the B ^{Si d} {1) leads to signals to be summed, which are also decorrelated. Conversely, in the case of the convolution between the input signals and the approximated average filter, the energy of the signal resulting from the summation of the filtered signals will depend on the correlation value existing between the input signals.

For example,

* if all input signals 1 (1) are identical and unit energy, and filters B (l) are all decorrelated (since diffuse fields) and unit energy, we have: all input signals 11 are decorrelated and unitary energy, and filters B 1 are all unitary energy,

= energy (

energy

Because the energies of the decorrelated signals are added.

This case is equivalent to the previous one in the sense that the signals coming from the filtering are all decorrelated, thanks to the input signals in the first case, and thanks to the filters in the second case.

all the input signals 1 (1) are identical and of unit energy, and that the filters B (1) are all of unit energy, but replaced by identical filters, we have:

energy {

ergy

Because the energies of the identical signals are added in quadrature (because their amplitudes are added). Thus, if two speakers are active simultaneously, powered by decorrelated signals, then no gain is made by applying the steps S23B1 and S23B2 compared to the conventional method. if two loudspeakers are active simultaneously, powered by identical signals, then a WAOG gain? fL) = 10, _ie, (2 ^2/2 ) = 3 lB is provided by applying steps S23B1 and S23B2 compared to the conventional method. if three loudspeakers are active simultaneously, fed by identical signals, then a gain of l0. _m (L ² / L) = 10 log (3 ^z / 3) - 4 7dB is provided by applying the steps S23B 1 and S23B2 compared to the conventional method.

The cases mentioned above correspond to the extreme cases of identical or uncorrelated signals. These cases are however realistic: a source positioned in the middle of two speakers, virtual or real, provide a signal identical to these two speakers (for example with a technique of type VBAP, for "Vector base amplitude panning"). In the case of positioning in a 3D system, the 3 loudspeakers can receive the same signal at the same level.

Thus, compensation can be applied to respect the energy of binauralized signals.

Ideally, this compensation gain G will be determined as a function of the input signal (ie G (I (1))) and will be applied to the sum of the weighted input signals: c =, 7 (0

1 = 1

The gain Sf / f) can be estimated by a calculation of correlation between each of the signals. It can also be estimated by comparing the energies of the signals before and after summations. In this case, the gain G may vary dynamically over time, depending for example on correlations between the input signals, which vary themselves over time.

In a simplified embodiment, it is possible to set a constant gain, for example G = -3dB = i0 ^_J '- ^:. This will avoid having to make a correlation estimate that can be expensive. The constant gain G can then be applied offline to the weighting factors (thus 7777777 :), or to the -Bmean filter, which will avoid the application of an additional gain on the fly.

Once isolated A and B transfer functions and filters B ^^^^ (optionally W &^Weights> ^'there) and G) calculated, applying these transfer functions of these filters and to the input signals.

In a first embodiment, described with reference to FIG. 4, the processing of the multichannel signal by applying the Direct (A) and Diffus (B) filters for each of the ears is carried out as follows:

The S4A1 to S4AL is applied to the multichannel input signal effective filtering (eg direct convolution based -FFT) by the Direct filters (A), as described in the state of the art. We obtain a signal 0 ° ^' As a function of the relationships between the input signals, in particular as a function of their correlation, it is optionally possible to correct, in step S4B11, the gain of the average filter Bmsayi ^{S /} P ^{by the} application of the gain G to the output signals after summation of the signal signals. input previously weighted (steps M4B1 to M4BL). The multichannel signal B at step S4B1 is applied efficiently by means of the _mean diffuse filter B _mean . This step takes place after summing the previously weighted input signals (steps M4B1 to M4BL). We obtain the signal Ô * ^' '^" .

A delay iDD is applied to the signal δ '^'"in order to compensate for the delay introduced during the step of isolating the signal B in step S4B2 .- The signals Ô ^ ^"" and Ô ^T " are summed.

If a truncation eliminating inaudible samples at the beginning of the impulse responses has been performed, then in step S41 the input signal is applied with a delay iT corresponding to the inaudible samples deleted.

In a variant, with reference to FIG. 5, the signals are not only calculated for the left and right ears (indices g and d above) but for k playback devices (typically loudspeakers).

In a second embodiment, the gain G is applied prior to the summing of the input signals, that is to say during the weighting steps (steps M4B1 to M4BL).

In a third embodiment, a decorrelation is applied to the input signals. Thus, the signals are decorrelated after convolution by the B _mean filter regardless of the original correlations between input signals. An efficient implementation of decorrelation (for example using a loopback network) can be used to avoid the use of expensive decorrelating filters.

Thus, realistically assuming that 48,000 sample length BRIRs can be truncated between sample 150 and sample 3222 by the technique described in step S21, broken down into two parts: direct field A of 1024 samples, and diffuse field B of 2048 samples, by the technique described in step S22, then the binauralization complexity can be approximately given by:

C _inv = C _invA + C _invB = (L + 2). (6.1og ₂ (2.NA)) + (L + 2). (6.1og ₂ (2.NB))

With NA and NB the sizes in samples of A and B

Thus for nBlocks = 10, Fs = 48000, L = 22, NA = 1024 and NB = 2048, the multichannel signal sample complexity for a FFT-based convolution is C _conv = 3312 multiplications-additions.

This result is, however, logically compared to a simple solution implementing only truncation, ie for nBlocs = 10, Fs = 3072, L = 22:

Ctrl = (L + 2). (NBlocs). (6.1og ₂ (2.Fs / nBlocks)) = 13339

There is therefore a factor 19049/3312 = 5.75 of complexity between the state of the art and the present invention, and another factor 13339/3312 = 4 of complexity between the state of the art benefiting from the truncation and the present invention.

If the size of B is a multiple of the size of A, then if the filtering is implemented in blocks of FFT, the calculation of an FFT for A can be reused for B. We therefore need L FFT on NA points, which will serve at the same time for the filtering by A and by B, two inverse FFT on NA points to obtain the binaural time signal, and the multiplication of the spectrums in frequency.

In this case, the complexity can be approximated (the additions are neglected, (L + 1) corresponds to the multiplication of the spectra, L for A and 1 for B) by:

C _inv2 = (L + 2). (6.1og ₂ (2.NA)) + (L + 1) = 1607

With this approach, we still gain a factor of 2, and therefore a factor of 12 and 8 compared to the state of the art untruncated and truncated.

The invention can find a direct application in the MPEG-H 3D Audio standard.

Of course, the present invention is not limited to the embodiment described above; it extends to other variants.

For example, an embodiment has been described above in which the Direct A signal is not approximated by an average filter. Of course, it is possible to use an average filter of A to make the convolutions (steps S4A1 to S4AL) with the signals coming from the loudspeakers. An embodiment has been described above based on the processing of multichannel content generated for L speakers. Of course, the multichannel content can be generated by any type of audio source such as voice, a musical instrument, any noise, etc.

An embodiment has been described above based on formulas applying in a certain field of computation (for example transformed domain). Of course, the present invention is not limited to these formulas and these formulas can be modified to be applicable in other computation domains (for example time domain, frequency domain, time-frequency domain, etc.).

An embodiment has been described above based on determined BRIR values in a room. Of course, the present invention can be implemented for any type of external environment (eg concert hall, open air, etc.).

An embodiment has been described above based on the application of two transfer functions. Of course, the present invention can be implemented with more than two transfer functions. For example, one can isolate in synchronism a part relating to the sounds emitted directly, a part relating to the first reflections and a part relating to diffuse sounds.

Claims

1. Sound spatialization method, in which at least one filtering, with summation, is applied to at least two input signals (1 (1), 1 (2), ..., I (L)), the filtering comprising: - the application of at least one first room effect transfer function (A ^k (1), A ^k (2),

..., A ^k (L)), said first transfer function being specific to each input signal, and the application of at least one second room effect transfer function (B _mean ^k ), said second transfer function being transfer function being common to all input signals, characterized in that the method comprises a weighting step of at least one input signal by a weighting weight (Ït ^¾ ()), said weighting weight being specific to each of the input signals.

2. Method according to claim 1, characterized in that said first and second transfer functions are respectively representative of: - direct sound propagation and first sound reflections of said propagations; and

a diffuse sound field present after said first reflections, and in that it comprises: the application of first transfer functions respectively specific to the input signals, and the application of a second transfer function, identical for all input signals, and resulting from an overall approximation of a diffuse sound field effect.

3. Method according to claim 2, characterized in that it comprises a preliminary step of construction of said first and second transfer functions from impulse responses incorporating a room effect, said preliminary step comprising, for the construction of a first Transfer function, the operations:

determining a moment of beginning of presence of direct sound waves, determining a start time of presence of said diffuse sound field after the first reflections, and

selecting, in an impulse response, a portion of the response that extends temporally between said instant of onset of presence of direct sound waves until said start of diffuse field presence, said selected response portion corresponding to said first transfer function.

4. Method according to claim 3, characterized in that the second transfer function is constructed from a set of impulse response portions starting temporally after said start time of presence of the diffuse field.

5. Method according to one of claims 3 or 4, wherein said second transfer function is given by applying a formula of the type: with k the index relating to an output signal, £ fl; Lj the index relating to an input signal, L the number of input signals, normalized transfer function obtained from a set of impulse response portions starting temporally after said start of presence of the diffuse field.

6. Method according to one of claims 3 to 5, characterized in that said filtering comprises the application of at least one compensation delay corresponding to a time difference between said instant of start of direct sound waves and said instant of beginning of diffuse field presence.

7. Method according to claim 6, characterized in that said first and second room effect transfer functions are applied parallel to said input signals and in that said at least one compensation delay is applied to the input signals filtered by said second transfer functions.

Method according to claim 1, characterized in that an energy compensation gain (G) is applied to the weighting weights

9. Method according to claim 1, characterized in that at least one output signal of said method is given by applying a formula of the type:

with k the index relating to an output signal, Q ^k an output signal, ί G | 1; The index relating an input signal among said input signals,

L the number of input signals,

1 (1) an input signal of said input signals,

A ^x {t) a room effect transfer function among said first room effect transfer functions,

S _{} jS £} , _H a room effect transfer function among said second room effect transfer functions,

W ^k (i) a weight weighting among said weighting weight id_ _'Corres _0NC p i _to the application of said delay compensation where. is the multiplication, and where * is the convolution operator.

10. Method according to claim 1, characterized in that it comprises a decorrelation step of the input signals, prior to the application of the second transfer functions and in that at least one output signal of said method is given. by application of a formula of the type:

with k the index relating to an output signal, Q ^k an output signal, l £ fl; The index relating an input signal among said input signals,

L the number of input signals,

1 (1) an input signal of said input signals,

I _d (l) a decorrelated input signal of said input signals,

- ^{, Sr} (.) A room effect transfer function among said first room effect transfer functions,

¾rasa _?! ^a room effect transfer function among said second room effect transfer functions,

W ^k (!) A weighting weight among said weighting weights, ζ ~ ^{ι Ώ ϋ} corresponds to the application of said compensation time, where. is the multiplication, and where * is the convolution operator.

11. Method according to claim 1, characterized in that it comprises a step of determining an energy compensation gain as a function of the input signals and in that at least one output signal is given by application. a formula of the type: with k the index relating to an output signal, 0 an output signal, ί E [1; The index relating an input signal among said input signals,

L the number of input signals,

1 (1) an input signal of said input signals,

G (I (1)) said determined energy compensation gain,

.4 H a room effect transfer function among said first room effect transfer functions, i? e ^"a transfer function with room effect of said second transfer functions with room effect, a weight weighting among said weighting weight, z ⁱ DD _Corres _0NC p i _to the application of said delay compensation where. is the multiplication, and where * is the convolution operator.

12. Method according to one of claims 1 to 11, characterized in that said weight is given by applying a formula of the type: with k the index relating to an output signal, E fi; Lj the index relating an input signal among said input signals,

L the number of input signals, with the energy of a room effect transfer function among said second room effect transfer functions, Sg & s¾ an energy relative to the gain in normalization.

13. Computer program comprising instructions for implementing the method according to one of claims 1 to 12, when these instructions are executed by a processor.

14. Sound spatialization device comprising at least one summation filter applied to at least two input signals (1 (1), 1 (2), I (L)), the filter using: at least one first function of room effect transfer (A ^k (I), A ^k (2), ..., A ^k (L)), said first transfer function being specific to each input signal, and at least one second function room effect transfer means (B _mean ^k ), said second transfer function being common to all the input signals, characterized in that it comprises weighting modules (M4B1, M4B2, M4BL) for weighting at least one input signal by a weighting weight (H ^¾ ()), said weighting weight being specific to each of the input signals.

15. The sound signal decoding module, comprising a spatialization device according to claim 14, of said sound signals as input signals.