EP3635718A1

EP3635718A1 - Processing of sound data for separating sound sources in a multichannel signal

Info

Publication number: EP3635718A1
Application number: EP18737650.4A
Authority: EP
Inventors: Mathieu BAQU; Alexandre Guerin
Original assignee: Orange SA
Current assignee: Orange SA
Priority date: 2017-06-09
Filing date: 2018-05-24
Publication date: 2020-04-15
Anticipated expiration: 2038-05-24
Also published as: US11081126B2; CN110709929A; CN110709929B; EP3635718B1; FR3067511A1; WO2018224739A1; US20200152222A1

Abstract

The present invention pertains to a method for processing sound data for separating N sound sources of a multichannel sound signal sensed in a real medium. The method comprises the steps of applying (E310) a processing for separating sources to the sensed multichannel signal and obtaining a separation matrix and a set of M sound components, with M≥N, of calculating (E320) a set of so-called bi-variate first descriptors representative of statistical relations between the components of the pairs of the set obtained of M components, of calculating (E320) a set of so-called uni-variate second descriptors representative of characteristics of encoding of the components of the set obtained of M components and of classifying (E340) the components of the set of M components, according to two classes of components, a first class of N so-called direct components corresponding to the N direct sound sources and a second class of M-N so-called reverberated components, by a calculation (E330) of probability of membership in one of the two classes, dependent on the sets of first and second descriptors. The invention also pertains to a processing device implementing the method such as described.

Description

of sound data for a separa

sound in a multichannel signal

The present invention relates to the field of audio or acoustic signal processing and more particularly to the processing of real multichannel sound contents to separate sound sources.

The separation of sources in a multichannel sound signal allows multiple applications. It can for example be used:

o For entertainment (karaoke: deleting the voice), o For music (mixing separate sources in multichannel content),

o For telecommunications (voice enhancement, denoising),

o For home automation (voice control),

o For multi-channel audio coding,

o For the location of sources and cartography in Imaging, In a space E in which a number N of sources emit a signal s, -, a blind separation of the sources consists, from a number M of observations from distributed sensors in this space E _r to count and extract the number N of sources. In practice, each observation is obtained using a sensor that records the signal up to a point in the space where the sensor is located. The recorded signal then results from the mixing and propagation in the space E of the signals 5 / and is therefore affected by different disturbances specific to the medium traversed, such as for example noise, reverberation, interference, etc.

The multichannel capture of a number N of sound sources If propagating in a free field and considered as punctual is formalized as a matrix operation:

Where x is the vector of the M registered channels, s the vector of N sources and A a matrix called "mixing matrix" of dimension MxN the contributions of each source to each

* symbolizes the linear convolution. Depending on the propagation medium and the antenna format, matrix A can take different forms. In the case of a coincident antenna (all microphones of the antenna are concentrated at the same point of space) in anechoic medium, A is a simple matrix of gains. In the case of a non-coincident antenna, in anechoic or reverberant medium, the matrix A becomes a filter matrix. In this case, we usually express the relation in the frequent domain x (f) = As (f), where A is expressed as a matrix of complex coefficients,

In the case where the sound signal is captured in an anechoic environment, and if we assume that the number of sources N is smaller than the number of observations M, the analysis (ie, the identification of the number of sources and their positions) and the decomposition of the scene into objects, ie the sources, can be easily performed jointly by an independent component analysis algorithm (or "ACI" hereinafter). These algorithms make it possible to identify the matrix B of NxM size separation, pseudo-inverse of A, which makes it possible to deduce the sources from the observations thanks to the following equation:

The preliminary step of estimating the dimension of the problem, ie the estimation of the size of the separation matrix, ie the number of sources N, is conventionally done by calculating the rank of the observation covariance matrix, which is , in this case anechoic, equal to

number of sources:

As for the location of the sources, it can be deduced from the encoding matrix A = B ^-1 and the knowledge of the spatial properties of the antenna used, in particular the distance between the sensors and their directivities.

Among the most well-known ACI algorithms are JF Cardoso and A. Souloumlac. ("ΒΙind beamforming for non-gaussian signals" p _roœe dings F - Radar and Signal Processing "

Dec. 1993) or Amarl and Infomax. al. ("A new learnlng algoritfim for blind signal separation, Advances" in "Neural Information Processing Systems", 1996).

In practice, under certain conditions, the separation step s = Bx amounts to making the formation of channels under constraint (or "beamforming" hereafter): the combination of different channels given by the matrix B consists in applying a filter spatial whose directivity amounts to imposing a unit gain in the direction of the source that we want to extract, and a zero gain in the direction of the interfering sources. An example of beamforming to extract three sources positioned at respectively 0 °, 90 ° and -120 ° of azimuth is illustrated in FIG. 1. Each of the directivities formed corresponds to the extraction of one of the sources of sys-

In the presence of a mixture of sources captured in real conditions, the room effect will generate a so-called reverberant sound field, denoted x _r , which will be added to the direct fields of the sources:

The total acoustic field can be modeled as the sum of the direct field of the sources of interest (represented in 1 in Figure 2), the first reflections (secondary sources, represented in 2 in Figure 2) and a diffuse field ( represented in 3 in Figure 2). The covariance matrix of the observations is then of full rank, regardless of the actual number of active sources in the mixture: this means that one can no longer use the rank of Co to estimate the number of sources.

Thus, when using an SAS algorithm for separating sources in a reverberant medium, the separation matrix B of size MxM is obtained, generating at output M sources in the place of the desired N, the last MN components essentially containing the reverberated field, by matricial topatlon:

These additional components pose several problems: for the stage act: we do not know a source and component components induced by the room effect.

for the separation of sources by channel formation: each additional component induces constraints on the directivities formed and generally degrades the directivity factor with the consequence of raising the level of reverberation in the extracted signals.

Existing methods of source counting for multi-channel content are often based on a sparse time-frequency hypothesis, ie, for each time-frequency zone, a single source or a number limited sources will have a non-negligible energy contribution. For most of them, a step of locating the most energetic source is performed for each zone (or "bin" in English), then the zones are aggregated (so-called "clusterlng" stage) to reconstruct the total contribution of each source.

The DUET approach (for "Degenerate Unmixing Technical Estimation") described for example in the document "Blind separation of disjoint orthogonal signed: Demixing n sources from 2 mixtures." Authors A, Jourjine, S. Rickard, and 0, Yilmaz, published in 2000 in ICASSPOO, makes it possible to locate and extract N sources in anechoic conditions from only two non-coincident observations, by assuming that the sources have disjoint frequency carriers, either

for any f since

After a decomposition of observations into frequency subbands, typically performed via a short-term Fourier transform, an amplitude a, and a delay ¾ are estimated for each subband based on the theoretical mixing equation: each frequency band f, a couple

the source / active is estimated as follows:

A representation in space of all couples is performed in the form of a histogram, the clustering is then performed on the maximum likelihood histogram, a function of the position of the zone and the assumed position of the associated source, assuming a Gaussian distribution of the estimated positions of the each area around the actual position of the sources.

In practice, the parsimony hypothesis of the sources in the time-frequency domain is often faulted, which constitutes a significant limitation of these approaches for the enumeration of sources, because the directions of arrival pointed for each zone then result from a combination of contributions from multiple sources and clustering is no longer working properly. In addition, for the analysis of content captured in real conditions, the presence of reverberation can on the one hand degrade the location of sources and on the other hand generate an over-estimation of the number of real sources when initial reflections reach a level sufficient energy to be perceived as secondary sources.

The present invention improves the situation.

To this end, it proposes a method for processing sound data for separating N sound sources from a multichannel sound signal picked up in a real medium. The method is such that it comprises the following steps:

- application of a source separation process to the captured multichannel signal and obtaining a separation matrix and a set of M sound components, with M≥N;

calculating a set of first descriptors said to be bivariate, representative of statistical relations between the components of the pairs of the set of M components obtained;

representative of encoding characteristics of the components of the set of M components obtained;

classification of the components of the set of M components, according to two classes of components, a first class of N so-called direct components corresponding to the N direct sound sources and a second class of MN components referred to as reverberated, by a calculation of probability of belonging. to one of the two classes, a function of the sets of first and second descriptors.

This method thus makes it possible to discriminate the components coming from direct sources and the components resulting from reverberation of the sources when the capture of the multi-channel sound signal takes place in a reverberant medium, that is to say with room effect. Thus, the set of first bivariate descriptors makes it possible to determine, on the one hand, whether the components of a pair of the set of components obtained following the source separation step belong to the same class of components. or a different class while the set of second univariate descriptors allows to define for a component, if it has more probability to belong to such or such class. This makes it possible to determine the probability of membership of a component to one of the two classes and thus to determine the N direct sound sources corresponding to the N components classified in the first class.

The various particular embodiments mentioned below may be added independently or in combination with each other, to the steps of the treatment method defined above.

In a particular embodiment, calculating a bivariate descriptor comprises calculating a coherence score between two components. This descriptor calculation makes it possible to know whether a pair of components corresponds to two direct components (2 sources) or if at least one of the components comes from a reverberant effect.

comprises determining a delay between the two components of the pair. This determination of the delay and the sign associated with this delay makes it possible to determine, for a pair of components, which component corresponds more probably to the direct signal and which component more probably corresponds to the reverberated signal.

According to a possible implementation of this descriptor calculation, the delay between two components is determined by taking into account the delay maximizing an inter-correlation function between the two components of the pair.

This method of obtaining the delay provides a determination of a reliable bi-varied descriptor.

In a particular embodiment, the determination of the delay between two components of a pair is associated with an indicator of reliability of the sign of the delay, a function of the coherence between the components of the pair.

In an alternative embodiment, the determination of the delay between two components of a pair is associated with a reliability indicator of the sign of the retord, a function of the ratio of the maximum of an inter-correlation function for delays of opposite sign.

These reliability indicators make it possible to make more reliable the probability, for a pair of components belonging to a different class, that each component of the pair is the direct component or the reverberated component.

According to one embodiment, the calculation of a unvaried descriptor is a function of a mapping between mixing coefficients of a mixture matrix estimated from the source separation step and the characteristics of encoding a source of the type plane wave. This descriptor calculation allows for a single component, to estimate the probability that the component is direct or reverberated.

In one embodiment, the classification of the components of the set of M components takes place by taking into account of the M components, and by calculating the

likely classifications of the M components.

In a possible implementation of this global approach, the most likely combination is calculated by determining a maximum of the likelihood values expressed as the product of the conditional probabilities associated with the descriptors _/ for the possible M classification combinations. components.

In a particular embodiment, a step of pre-selecting the possible combinations is performed based on the only unl-varlated descriptors before the step of calculating the most probable combination.

This thus reduces the likelihood calculations to be performed on the possible combinations since this number of combinations is restricted by this pre-selection step.

In an alternative embodiment, a component pre-selection step is performed based on the only unl-varlated descriptors before the step of calculating the bivariate descriptors.

Thus, the number of bi-varied descriptors to be calculated is limited, which reduces the complexity of the process.

In an exemplary embodiment, the multichannel signal is an ambisonic signal.

This method of treatment thus described applies perfectly to this type of signal.

The invention also relates to a sound data processing device implemented to perform a separation processing of N sound sources of a multichannel sound signal picked up by a plurality of sensors in real environment. The device is such that it comprises:

an input interface for receiving the signals picked up by a plurality of sensors, the multichannel sound signal;

a processing circuit comprising a processor and able to implement:

a source separation processing module applied to the multichannel signal picked up to obtain a

sound, with M≥N;

a calculator able to compute a set of first descriptors called bhvariates, representative of statistical relations between the components of the pairs of the set of M components obtained and a set of second descriptors said uni-varied representative of encoding characteristics of the components of the set of M components obtained;

o a module for classifying the components of the set of M components, according to two classes of components, a first class of N so-called direct components corresponding to the N direct sound sources and a second class of MN components called reverberated, by a probability calculation belonging to one of the two classes, a function of the sets of first and second descriptors;

an output interface for delivering the classification information of the components.

The invention also applies to a computer program comprising code instructions for implementing the steps of the processing method as described above, when these instructions are executed by a processor and to a storage medium, readable by a processor, on which is recorded a computer program comprising code instructions for performing the steps of the processing method as described.

The device, program and storage medium have the same advantages as the method described above, which they implement. Other characteristics and advantages of the invention will appear more clearly on reading the following description, given solely for

which :

FIG. 1 illustrates a channel formation for extracting three sources according to a method of source separation of the state of the art as described above;

FIG. 2 illustrates an impulse response with room effect as previously described;

FIG. 3 illustrates, in flowchart form, the main steps of a processing method according to one embodiment of the invention;

FIG. 4 illustrates, as a function of frequency, coherence functions representing bi-varied descriptors between two components according to one embodiment of the invention, and according to different pairs of components;

FIG. 5 illustrates the probability densities of the average coherences representing the bivariate descriptors according to one embodiment of the invention and for different pairs of components and different numbers of sources;

FIG. 6 illustrates inter-correlation functions between two different class components according to one embodiment of the invention and according to the number of sources;

FIG. 7 illustrates the probability densities of a plane wave criterion as a function of the class of the component, of the ambisonic order and of the number of sources, for a particular embodiment of the invention;

FIG. 8 illustrates a hardware representation of a processing device according to one embodiment of the invention, implementing a processing method according to one embodiment of the invention; and

FIG. 9 illustrates an exemplary probability law calculation for a criterion of coherence between a direct component and a reverberated component according to one embodiment of the invention. FIG. 3 illustrates the main steps of a sound data processing method for a separation of N sound sources from a muitican sound signal captured in a real medium in one embodiment of the invention.

Thus, from a muiticanal signal captured by a plurality of sensors placed in a real medium, that is to say reverberant, and delivering a number M of observations from these sensors (x (x _v ... _t x _M )), the method implements a step E310 blind separation of sound sources (SAS). In this embodiment, it is assumed here that the number of observations is equal to or greater than the number of active sources.

The use of a blind source separation algorithm applied to the M observations allows, in the case of a reverberant medium, to extract by formation of M sound components associated with an estimated mixing matrix A _MXM , that is:

s = Bx with x \ e vector of M observations, B the separation matrix estimated by the blind separation of sources, MxM dimensions and s the vector of M extracted sound components. Among these are theoretically N sound sources and M-N residual components corresponding to reverberation.

To obtain the separation matrix B, the step of blind separation of sources can be implemented, for example using an independent component analysis algorithm (or "ACI"), or a component analysis algorithm. main.

In an exemplary embodiment, we are interested in ambisonic-type muiticanal signals.

Ambisia consists of a projection of the acoustic field on a basis of spherical harmonic functions, to obtain a spatial representation of the sound stage. The function is the harmonic

spherical of order m and net index, dependent on the spherical coordinates (θ, ψ), defined with the following formula: where is a polar function involving the polynomial of

Legendre:

for n> 1

In practice, a real ambisonic encoding is done from a network of sensors, generally distributed over a sphere. The captured signals are combined to synthesize an ambisonic content whose channels respect the directivity of spherical harmonics. The basic principles of ambisonic encoding are described below.

Ambisonic formalism, initially limited to the representation of spherical harmonic functions of order 1, was later extended to higher orders. Ambisonic formalism with a larger number of components is commonly referred to as "Higher Order Ambisonlcs" (or "HOA" hereinafter),

At each order m correspond 2m + l spherical harmonic functions. Thus, a content of order m contains a total of (m + 1) ² channels (4 channels at order 1, 9 channels at order 2, 16 channels at order 3, and so on).

Hereinafter "ambient components" is understood to mean the ambisonic signal in each ambisonic channel, with reference to the "vector components" in a vector base that would be formed by each spherical harmonic function. For example, we can count:

an ambisonic component for the order m = 0,

three ambisonic components for the order m = 1,

five ambisonic components for the order m = 2,

- seven ambisonic components for the order m = 3, etc.

are then distributed on a number M of channels which is deduced from the maximum order m that it is expected to capture in the sound scene. For example, if a sound scene is captured with an amblonic microphone with 20 piezoelectric capsules, then the maximum ampsonic order picked up is m = 3, so that there are not more than 20 channels M = (m + 1) ² , the number of amblsonic components considered is 7 + 5 + 3 + 1 = 16 and the number M of channels is M = 16, given otherwise by the relation M = (m + 1) ² , with m = 3.

Thus in the example of implementation where the multichannel signal is an amsonic signal, step E310 receives the signals x {x ₁ . _, .., x ₁ ..., x _M \ picked up by a real microphone, in a reverberant environment and which receives frames of ambisonic sound content on M = (m + 1) ² channels and containing N sources.

The blind separation of sources is therefore performed in step E310 as explained above.

This step makes it possible both to extract M components and the estimated mixing matrix. The components obtained at the output of the source separation step can be classified according to two classes of components: a first class of direct components corresponding to the sources direct sound and a second class of so-called reverberated components corresponding to the reflections of the sources.

In step E32Q, a computation of descriptors of the M components (si, S2, ... s _M ) resulting from the source separation step is implemented, descriptors which will make it possible to associate with each component extracted the corresponding class: direct component or reverberated component.

Two types of descriptors are computed here: bi-variant descriptors that involve pairs of components and univariate descriptors calculated for a component ¾.

Thus, a set of first bi-variate descriptors is calculated. These descriptors are representative of statistical relations between the components of the pairs of the set of M components obtained.

respective components:

- The two components are direct fields,

- One of the two components is direct and the other is reverberated,

- Both components are reverberated.

According to one embodiment, an average coherence between two components is calculated here. This type of descriptor represents a statistical relationship between the components of a couple and provides an indication of the presence of at least one reverberated component in a pair of components.

Indeed, each direct component consists mainly of the direct field of a source, comparable to a plane wave, plus a residual reverberation whose energy contribution is lower than that of the direct field. Since the sources are statistically independent by nature, there is therefore a weak correlation between the extracted direct components.

A (Inverse, each reverberated component consists of early reflections, delayed and filtered versions of the direct field (s), and late reverberation, so that the reverberated components exhibit a significant correlation with the direct components, and usually a group delay identifiable in relation to the direct components.

The coherence function informs about the existence of a correlation

between two signals and expresses himself according to the formula:

where is the interspectre between sj and if and are the autospectres

respective

Consistency is ideally zero when are the direct fields from independent sources but it takes a high value when are two contributions from the same source; the direct field and a first reflection or two reflections. such a coherence function indicates

two direct components or two contributions from the same source (direct / reverberated or first reflection / later reflections).

In practice, the interspectres and aulospectres can be calculated by segmenting the extracted components in K frames (adjacent or overlapped), by applying a short-term Fourier transform to each frame k of these K frames to produce the instantaneous spectra and by means of observations on K fields;

The descriptor used for a broadband signal is the average over all the frequencies of the coherence function between two components, namely:

Consistency being bounded between 0 and 1, the average coherence will also be in this range, tending towards 0 for perfectly independent signals and towards 1 for strongly correlated signals. Figure 4 gives an overview of consistency values as a function of frequency for the following cases:

- Case N ° 1 where the coherence values are obtained for two direct components coming from 2 distinct sources.

Case No. 2 where the coherence values are obtained for a pair of direct components and reverberated for a single active source.

Case No. 3 where the coherence values are obtained for a pair of direct and reverberant components but when two sources are active simultaneously.

Note that in the first case, the coherence value dy is less than 0.3 while in the second case d ^y reaches 0.7 in the presence of a single active source. These values reflect both the independence of the direct signals and the relationship between a direct signal and the same reverberated signal, in the absence of interference. However, by incorporating a second active source in the initial mixture (Case No. 3), the coherence _the direct / reverberated _ow goes down to 0.55 and

depending on the spectral content and the energy level of the different sources, Here the competition from the different sources of milk drop coherence at low frequencies, while the values are higher above 5500 Hz due to a smaller contribution from the interfering source.

We therefore note that the determination of a probability of belonging to the same class or a different class for a component pair may depend on the number of sources that are in principle active. For the classification step E340 described later, this parameter can be taken into account in a particular embodiment,

In step E330 of FIG. 3, a probability calculation is deduced from the descriptor thus described.

In practice, the probability densities of FIGS. 5 and 7 described below, and more generally all the probability densities of the descriptors, are learned in a statistical manner on databases comprising various acoustic (reverberant / masts) and different acoustic conditions. sources (male / female voice, French / English languages ...). The components are classified informally: to each source is associated the extracted component closest spatially, the remaining being classified as reverberated components. To calculate the position of the component, we use the first 4 coefficients of its mixing vector from matrix A (ie, order 1), which is the inverse of the separation matrix 8. Assuming that this vector follows the rule of encoding of a plane wave is:

where (θ, φ) represent the spherical coordinates, azimuth / elevation, of the source, it is possible to deduce by simple trigonometric computation the position of the component extracted by the following set of equations:

where arctan2 is the arctangent function which makes it possible to remove the ambiguity of sign of the arctangent function.

Once the signals are classified, the different descriptors are calculated. From the point cloud - from the database - for a given class is extracted a histogram of values of the descriptor from which a probability density is selected from a collection of probability densities, based on a distance, generally the divergence of Kullback-Leibler. FIG. 9 shows an example of law calculation for the criterion of coherence between a direct component and a reverberated component: the lognormal law has been selected from among ten laws because it minimizes the Kullback-Leibler divergence.

For the example of an ambisonic signal, FIG. 5 represents the distributions (probability density or pdf for "Probability density function") associated with the value of the average coherence between two components.

The probability laws represented here are presented for an ambisonic microphone capture with 4 channels (ambisonie order 1) or 9 channels (ambisonie with order 2), in the case of one or two active sources simultaneously. We first observe that the average consistency d? takes significantly lower values for pairs of direct components compared to cases where at least one of the components is reverberated, and this observation is all the more marked as the ambisonic order is high. This is due to a better selectivity of channel formation when the number of channels is larger, and therefore to a better separation of extracted components.

It can also be seen that, in the presence of two active sources, the coherence estimators are degraded, whether it is the direct / reverberated or reverberated / reverberated pairs (in the presence of a single source, the direct / direct pair does not exist). .

the number of sources in the mix, and the number of sensors available.

This descriptor is therefore relevant for detecting whether a pair of extracted components corresponds to two direct components (2 true sources) or if at least one of the two components originates from the room effect.

In one embodiment of the invention, another type of bi-varied descriptor is calculated in step E320. Either this descriptor is calculated in place of the coherence type descriptor described above, or in addition to it.

This descriptor will make it possible to determine, for a pair (direct / reverberated) which component is more likely the direct signal and which corresponds to the reverberated signal, based on the simple assumption that the first reflections are delayed and attenuated versions of the signal. direct.

This descriptor is based on another statistical relationship between the components, the delay between the two components of the couple. Delay is defined as the delay that maximizes the intercorrelation function

between the components of a couple of components

:

When ¾ is a direct signal and ¾ an associated reflection, the plot of the intercorrelation function will usually show a negative.

Thus, if we know that we are in the presence of a pair of direct / reverberant components, we can theoretically attribute the dash to each of the components thanks to the sign of

In practice, the estimate of the sign of is often very noisy,

sometimes even reversed:

- When the scene consists of a single source, there is not necessarily a group delay that emerges distinctly if the field

late. Moreover the direct components extracted by SAS always contain a more or less important room effect residue, which will noise the measurement of the delay.

- When several sources are present, the interferences disturb the measurement, all the more so if the analysis frames are short and all the direct fields have not been perfectly separated.

For these reasons, one can choose to make reliable the sign of used

as a descriptor, thanks to an indicator of robustness or reliability.

The average coherence between the components makes it possible to evaluate the pertinence of the reflected-reverberation pair as seen previously. If it is strong, we can hope that the group delay will be a reliable descriptor.

On the other hand, the relative value of the peak of inter-correlation at

other values of the inter-correlation function also informs about

the reliability of the group delay. Figure 6 illustrates the emerging character of the autocorrelation peak between a direct component and a reverberated component. On the upper part (1) of Fig. 6 where only one source is present, the maximum inter-correlation clearly emerges from the inter-correlation remainder, reliably indicating that one of the components is lagging behind the other. It emerges in particular with respect to the values of the autocorrelation function for signs opposite to that of

(that of the r positive in Figure 6) which are very small, whatever the value of r.

In a particular embodiment, a second indicator of reliability of the sign of the delay called emergence is defined, by calculating the ratio between the absolute value of the intercorrelation to and that of the maximum correlation for r's of sign opposite to that of :

where is defined by:

This ratio, which we call emergence, is an ad hoc criterion whose relevance is verified in practice: it takes values close to 1 for independent signals, l.e. 2 direct components, and higher values for correlated signals as a direct component and a reverberated component. In the aforementioned case of curve (1) of FIG. 6, the emergence value is 4.

So we have a descriptor which determines, for each pair supposed to be direct / reverberated, the probability for each component of the couple to be the direct component or the reverberated component. This descriptor is a function of the sign of the average coherence between the components

and the emergence of maximum intercorrelation.

It should be noted that this descriptor is sensitive to noise, and in particular to the presence of several simultaneous sources, as illustrated in curve (2) of FIG. 6: in the presence of two sources, even if the maximum correlation still emerges, its relative value - 2.6 - is less because of the presence of an interfering source which reduces the correlation between the extracted components. In a particular embodiment, the reliability of the sign of the delay will be measured as a function of the value of the emergence, which will be weighted by the number of sources to be detected a priori.

With this descriptor, a probability of belonging to a first class of direct components or a second class of reverberant components for a pair of components is calculated in step E330. For Sj identified as being in advance on%, the probability that ¾ either direct and reverberated by a two-dimensional law.

Logically, we then estimate the probability that Sj is reverberated and direct even though sj is in advance of phase as the complement to 1 of the direct / reverberated case:

or are the respective classes of the components

being the first class of so-called direct components corresponding to the N direct sound and C ^r , the second class of M

reverberated.

This descriptor can only be used for direct / reverberant couples. Direct / direct and reverberated / reverberated couples are not concerned by this descriptor, so we consider them as equiprobable:

The sign of delay is a reliable indicator when both consistency and emergence have medium or high values. A weak emergence or a weak coherence will make the couples direct / reverberated or reverberated / direct equiprobables.

In step E320, a set of second unidimensional descriptors representative of encoding characteristics of the components of the set of M components obtained is also calculated.

Knowing the capture system used, the encoding of a source coming from a given direction is done with mixing coefficients depending, among other things, on the directivity of the sensors. In the case where the source can be considered as point and where the wavelengths are large compared to the size of the antenna, one can consider the source as a plane wave. This assumption is generally true in the case of an ambisonic microphone that is small, provided that the source is sufficiently far from the microphone (in practice, one meter is enough).

For Sj component extracted by SAS, the j ^th column of the estimated mixing matrix A, obtained by inverting the separation matrix B, will contain the mixture of coefficients associated therewith. If this component is direct, that is to say that it corresponds to a single source, the mixing coefficients of the column Aj will tend towards the characteristics of the microphone encoding for a plane wave. In the case of a reverberated component, the sum of several reflections and a diffuse field, the estimated mixing coefficients will be more random and not encoding a single source

precise arrival.

One can therefore use the conformity between the estimated mixing coefficients and the theoretical mixing coefficients for a single source to estimate a probability that the component is direct or reverberated.

In the case of an ambisonic microphonic captatlon of order 1, the encoding of a plane wave Sj of incidence in ambisonic format says

N3D is carried out according to the formula:

Or

There are indeed several ambisonic formats, which differ in particular by the standardization of the various components grouped in order. Here we consider the known format N3D. The different formats are for example described at the following link:

https://en.wikipedia.org/wiki/Ambisonic data exchange formats. It is thus possible to deduce from the encoding coefficients of a source a criterion, called plane wave criterion, which illustrates the conformity between the estimated mixing coefficients and the theoretical equation of an encoded plane wave alone:

The criterion c _op is by definition equal to 1 in the case of a plane wave. In the presence of a correctly identified direct field, the plane wave criterion will remain very close to the value 1, conversely, in the case of a reverberated component, the multitude of contributions (first reflections and late) with energy levels

generally move the plane wave criterion away from its ideal value.

For this descriptor as for the others, the distribution associated and calculated in E330, knows a certain variability, according in particular according to the level of noise present in the extracted components. This noise consists mainly of residual reverberation and contributions from interfering sources that have not been perfectly canceled. One can thus choose, to refine the analysis, to estimate the distribution of the descriptors according to:

- The number of channels used (therefore here the order of the atmosphere), which influences the selectivity of the "beamforming" and therefore the residual noise level,

the number of sources contained in the mixture (as for the previous descriptors), the increase of which mechanically causes a rise in the noise level and a greater variance in the estimation of the separation matrix B, hence of A,

The probability laws (probability density) associated with this descriptor can be observed in FIG. 7, as a function of the number of active sources simultaneously (1 or 2) and of the ambisonic order of the content analyzed (orders 1 to 2). According to the initial hypothesis, the value of the plane wave criterion is concentrated around the value 1 for the direct components. For reverberated components, the distribution is more uniform, but with a slightly asymmetrical shape, because of the descriptor itself, which is asymmetric, with a 1 / x form.

The distance between the distributions of the two classes allows a fairly reliable discrimination between the components of the flat wave type and those more diffuse.

Thus, the descriptors calculated in step E320 and exposed id are based on both the extracted component statistics (average coherence and group delay) and on the estimated mixing matrix (plane wave criterion). These make it possible to determine conditional probabilities of belonging of a component to one of the two classes C ^d or C ^r .

E340 to determine a classification of the components of the set of M components, according to the two classes.

For a component s _j , we denote by Cj the corresponding class. In order to classify the set of M components extracted, we call the "configuration" the vector of classes C of dimension lxM such that:

Knowing that there are two possible classes for each component, the problem is ultimately to choose from a total of 2 ^M potential configurations assumed equiprobable. To do this, the rule of the posterior maximum is applied: knowing the likelihood of the configuration, the configuration chosen will be the one with the maximum likelihood, ie:

The chosen approach can be exhaustive and then consists in estimating the likelihood of all the possible configurations, from the descriptors determined in step E320 and the distributions associated with them which are calculated in step E330.

According to another approach, a pre-selection of the configurations can be performed to reduce the number of configurations to be tested, and therefore the complexity of the implementation of the solution. This pre-selection can be done for example according to the plane wave criterion alone by classifying certain components in the category when the value of their criterion away from the theoretical value of a plane wave 1: in the case of ambisonic signals, we can see on the distributions of Figure 7 that we can, whatever the configuration (order or number of sources) and a priori without loss of robustness, classify in the category c ^T the components whose checks one of the following inequalities:

test by pre-classifying certain components, excluding configurations that impose class c ^d on these pre-classified components.

Another possibility for further reducing the complexity is to exclude the pre-classified components of the computation of the bi-varied descriptors and the likelihood calculation, which reduces the number of bi-varied criteria to be calculated and therefore even more complexity. treatment.

To estimate the likelihood of each configuration using the calculated descriptors, a naive Bayesian approach can be used. In this type of approach, we give ourselves a set of descriptors for each component sj. For each descriptor, we formulate the probability for the component sj to belong to the class thanks to the law of Bayes:

Both classes being assumed to be equitable, it follows

as well as

We then obtain:

where the term is abbreviated To lighten the notations, In this case to find the maximum likelihood, the denominator term of each conditional probability is constant regardless of the evaluated configuration. Also, we can later simplify the expression: a bi-varied descriptor (as for example

to intervene two components sj and sl and their respective classes supposed, one extends the preceding expression:

And so on.

Likelihood is expressed as the product of the conditional probabilities associated with each of the K descriptors, assuming that they are independent: where d is the vector of the descriptors and C is a vector representing a configuration (ie the combination of the supposed classes of the M components), as defined above.

More precisely, a number K1 of univariate descriptors is used for each of the components, while a number / type of bi-varied descriptors is used for each pair of components. Since the laws of descriptor probabilities are established according to the number of supposed sources and the number of channels (the index m represents the ambisonic order, in the case of a capture of this type), we formulate the final expression. likelihood:

or

- is the value of the index descriptor k for the component Sj;

- is the value of the bi-varied descriptor of index k for the components ¾ and si;

- C) and Ci are the supposed classes of the components Jet /;

- / Vis the number of active sources associated with the evaluated configuration: For computational reasons, likelihood is preferred by its logarithmic version (log-likelihood): This equation is the one ultimately used to determine the most likely configuration in the Bayesian classifier described here for this embodiment.

The Bayesian classifier presented here is only one example of implementation, it could be replaced, inter alia, by a carrier vector machine or a neural network.

Finally, the configuration presenting the maximum likelihood is retained, indicating the direct or reverberant class associated with each of the M components.

From this combination, the N components corresponding to the N active direct sources are deduced.

The processing described here is performed in the time domain, but may also be, in an alternative embodiment, applied in a transformed domain.

The method as described with reference to FIG. 3 then being implemented by frequency subbands after passing through the transformed domain of the signals picked up.

Moreover, the useful bandwidth can be reduced according to the potential imperfections of the captaBon system, in high frequencies (presence of spatial folding) or at low frequencies (impossibility to find the theoretical directivities of the microphonic encoding).

FIG. 8 represents here an embodiment of a processing device (DIS) according to an embodiment of the invention.

Sensors represented here in the form of a spherical microphone MIC make it possible to acquire, in a real medium, thus reverberant, M mixing signals from a multichannel signal.

heard, other forms of microphones or

to be planned. These sensors can be integrated in the DIS device or outside the device _/ the resulting signals are then transmitted to the processing device that receives them via its input interface 840. Alternatively, these signals can simply be obtained beforehand and imported. in memory of the DIS device.

These M signals are then processed by a processing circuit and computer means such as a PROC processor 860 and a working memory MEM 870. This memory may include a computer program including code instructions for the implementation of steps of the processing method as described for example with reference to Figure 3 and in particular the steps of applying a source separation process to the multichannel signal captured and obtaining a set of M sound components, with M≥N , calculating a set of first descriptors said bivariate, representative of statistical relations between the components of the pairs of the set of M components obtained and a set of second descriptors said uni-varied representative of encoding characteristics components of the set of M components obtained and of classification of the components of the set of M components, according to two classes of components, a p first class of N so-called direct components corresponding to the N direct sound sources and a second class of M-N components called reverberated, by a calculation of probability of belonging to one of the two classes, a function of the sets of first and second descriptors.

Thus, the device comprises a source separation processing module 810 applied to the multichannel signal picked up to obtain a set of M sound components. with M≥N. The M components are provided at the input of a calculator 820 capable of calculating a set of first so-called bi-varied descriptors, representative of statistical relations between the components of the pairs of the set of M components obtained and a set of second descriptors said to be uni -variés of encoding features of the compo

M components obtained.

These descriptors are used by a classification module 830 or classifier, able to classify components of the set of M components, according to two classes of components, a first class of N so-called direct components corresponding to N direct sound sources and a second class of MN components called reverberated.

For this purpose, the classification module comprises a module 831 for calculating the probability of belonging to one of the two classes of the components of the set M, which is a function of the sets of first and second descriptors.

The classifier uses descriptors related to the correlation between the components to determine which are direct signals (ie true sources) and which are reverb residues. It also uses descriptors related to SAS-estimated mixing coefficients, to evaluate the conformity between the theoretical encoding of a single source and the estimated encoding of each component. Some of the descriptors are therefore a function of a pair of components (for the correlation), and others are functions of a single component (for the conformity of the estimated microphonic encoding).

A likelihood calculation module 832 makes it possible to determine, in one embodiment, the most probable combination of the classifications of the M components by a calculation of likelihood values according to the probabilities calculated in module 831 and for the possible combinations.

Finally, the device comprises an output interface 850 for outputting the classification information of the components, for example to another processing device that can use this information to enhance the sound of the discriminated sources, to denoise them or to perform a mixing from several discriminated sources. Another possible treatment may also be to analyze or locate the sources to optimize the processing of a voice command.

Many other applications using (classification information thus determined, are then possible.

for example, capturing sound scenes or for voice command sound recording. The device can also be integrated in a communication terminal capable of processing signals picked up by a plurality of integrated or remote sensors of the terminal.

Claims

1. A method of processing sound data for a separation of N sound sources of a multichannel sound signal captured in real environment, characterized in that it comprises the following steps:

- applying (E310) a source separation process to the captured multichannel signal and obtaining a separation matrix and a set of M sound components, with M> N;

calculating (E320) a set of first so-called bivariate descriptors, representative of statistical relations between the components of the pairs of the set of M components obtained;

calculating (E320) a set of second so-called univariate descriptors representative of encoding characteristics of the components of the set of M components obtained;

classification (E340) of the components of the set of M components, according to two classes of components, a first class of N so-called direct components corresponding to the N direct sound sources and a second class of MN components called reverberated, by a calculation (E330 ) of the probability of belonging to one of the two classes, a function of the sets of first and second descriptors.

2. Method according to claim 1, wherein the calculation of a bi-varied descriptor comprises calculating a coherence score between two components.

3. Method according to one of claims 1 to 2, wherein the calculation of a bi-varied descriptor comprises determining a delay between the two components of the pair.

4. The process according to claim 3, wherein

components is determined by taking into account the delay maximizing an inter-correlation function between the two components of the couple.

5. Method according to one of claims 3 or 4, wherein the determination of the delay between two components of a pair is associated with a reliability indicator of the sign of the delay, a function of the consistency between the components of the couple.

6. Method according to one of claims 3 or 5, wherein the determination of the delay between two components of a pair is associated with an indicator of reliability of the sign of the delay, a function of the ratio of the maximum of a function of inter -correlation for delays of opposite sign.

7. Method according to one of claims 1 to 6, wherein the calculation of a unvaried descriptor is a function of a matching between mixing coefficients of a mixture matrix estimated from the step of source separation and encoding characteristics of a plane wave source.

8. Method according to one of claims 1 to 7, wherein the classification of the components of the set of M components is effected by taking into account all the M components, and by calculating the combination of more likely classifications of the M components.

The method of claim 8, wherein the calculation of the most likely combination is made by determining a maximum of the likelihood values expressed as the produces associated conditional probabilities

for possible combinations of classification of the M components.

The method of claim 8, wherein a step of preselecting the possible combinations is performed based on the univariate descriptors only before the step of calculating the most likely combination.

11. The method as claimed in one of the preceding claims, in which a step of pre-selecting the components is performed based on the only univariate descriptors before the step of calculating the bi-varied descriptors.

12. Method according to one of the preceding claims, wherein the multichannel signal is an ambisonic signal.

13. A sound data processing device implemented for performing a separation processing of N sound sources of a multichannel sound signal picked up by a plurality of sensors in real environment, characterized in that it comprises:

an input interface for receiving the signals picked up by a plurality of sensors, of the multichannel sound signal;

a processing circuit comprising a processor and able to control:

a source separation processing module applied to the multichannel signal picked up to obtain a separation matrix and a set of M sound components, with M≥N;

a calculator capable of calculating a set of first descriptors said to be bivariate, representative of statistical relations between the components of the pairs of the set of M components obtained and a set of second descriptors said uni-varied encoding characteristics of the components of the set of M components obtained;

an output interface for delivering the classification information of the components,

14. Computer program comprising code instructions for implementing the steps of the processing method according to one of claims 1 to 12, when these Instructions are executed by a processor.

15. A storage medium, readable by a processor, on which is stored a computer program comprising code instructions for performing the steps of the processing method according to one of claims 1 to 12,