EP2452293A1

EP2452293A1 - Source location

Info

Publication number: EP2452293A1
Application number: EP10751985A
Authority: EP
Inventors: Zaher El Chami; Alexandre Guerin
Original assignee: France Telecom SA
Current assignee: Orange SA
Priority date: 2009-07-10
Filing date: 2010-07-08
Publication date: 2012-05-16
Also published as: FR2947931A1; WO2011012789A1

Abstract

According to the invention, the incoming direction of a first number of signals emitted by a single first number of sources, respectively, is determined within a space during a time range that is subdivided into frames. Two observations of said signals are accessed over the time range. For each frame, a first probability of the presence of one of the sources is calculated, on the basis of the observations, in each direction of a set of spatial directions. For a second number of frames from a time window, a second probability of the presence of one of the sources is calculated, on the basis of the first probability, in each direction of a set of spatial directions. The directions, for which there is a local maximum in the second probability, are searched, each of said directions corresponding to the incoming direction of one of the signals.

Description

LOCATION OF SOURCES

The present invention relates generally to the automatic counting of the number of sources present in a mixture, and to the determination of the direction of arrival of signals emitted by a plurality of sources, and more particularly in the case where the number of sources is unknown a priori.

The invention finds applications, in particular, in the production of multitrack audio coders / decoders, in particular those of the "MPEG Surround" type, for creating, from a stereo audio track, an audio track comprising more than two channels, and / or to modify the spatial characteristics by simulating a displacement of the different sources having emitted the recorded signals. It also applies to automatic source separation to reduce noise, echo, and noise interference from sources, particularly in the case of group audio conferencing. Another application is the localization in a stereo track of the direction of arrival of certain specific sources in order to remove them, for example the sources corresponding to the voices to produce a musical track for karaoke from the original track.

In a space E in which a number N of sources each emits a signal s ,, the blind separation of said sources consists of estimating the arrival direction of the signals Sj and the number N of sources from a set comprising a number M observations. Each observation is obtained by means of a sensor which records the signal up to a point in the space where said sensor is located. The signal thus recorded then results from mixing and propagation in the space E of the signals S ₁ , and is therefore affected by different disturbances specific to the medium traversed, such as for example noise, reverberation, interference, etc.

The known source separation methods can be classified into two main categories. The first group includes all the methods for which the number of sources N must be known a priori. However, this information is not always available. Even if it is accessible, every first use and every change of number of sources over time, it is necessary to re-parameterize these methods. This makes it particularly difficult to use these in the context of automated processing. In addition, the knowledge of the number of sources determines the accuracy of the result obtained. The speed with which the number of sources evolves in the environment, as well as the delay of modification of this parameter in this first category of methods are thus likely to degrade their reliability.

A second category of methods, among which one can think of the method described in the document "A Robust Method to Count and Locate Audio Sources in a Stereophonic Linear Anechoic Mixture" - Arberet, S .; Gribonval, R .; Bimbot, F.- Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol.3, no., Pp.lll-745-lll-748, 15-20 April 2007 ", proposes to detect, in the observations M, time-frequency zones for which only one source is dominant, that is to say a source whose energy of the corresponding signal s is greater than the sum of the energies of the other signals s ,. The delays and amplitude differences observed between the two-by-two sensors are then determined in order to calculate the so-called binaural indices of each source. The directions of arrival of the signals s, are then calculated, then the number of sources is then deduced by counting the number of directions of arrival corresponding to an active source. However, the methods of this category can only be used for so-called anechoic environments in which the observations are not affected by any reverberation. Indeed, these methods are based on the assumption that a binaural index is invariant whatever the frequency of the signals s ,. This assumption is valid in the case of anechoic environment but is no longer verified in an environment affected by reverberation. In addition, reverb makes it difficult to detect time-frequency zones where only one source is dominant. It follows that this limitation excludes the use of these methods with observations made in echoic environments, among which, for acoustic signals, meeting rooms, restaurants and more generally all current places in which the sound is taken.

The present invention aims to answer an additional problem. The known methods for determining a direction of arrival of a signal coming from a single source in space are based on a phase difference φ (t, ω) of the signal at the level of several sensors distributed in the space.

For that, they propose to compute a probability p of presence of the source as a function of time and direction of arrival, resting for example on the following cost function: where the function P is a law of probability, R the ratio between the signals received by the sensors and G (ω) is a weighting function making it possible to give more weight to certain frequencies as a function of the configuration of the space. Now the difference in the phase arg [f? (F, ω)] is a discontinuous functional because the latter is defined modulo 2τr, and generally has jumps of 2π as a function of the frequency, whereas the function ωr is linear and continuous. However, there are so-called "phase unwrapping" techniques in English to make this phase difference linear, but these methods of phase correction are not efficient, especially in the presence of noise or multiple sources.

The invention proposes means, described hereinafter and in particular illustrated in FIG. 3, to solve this additional problem.

The present invention aims to improve the situation.

A first aspect of the present invention provides a method for determining the arrival direction of a first number of sound signals emitted by sources, over a time range subdivided into frames, in a space, from the knowledge of at least two observations obtained using sensors. The method comprises the following steps: / a / for each frame, we calculate, from the observations, for each direction of a set of directions of the space, a first probability of presence of one of the sources; IbI for a second number of frames of a temporal window, a second probability of presence of one of the sources in each direction of a set of directions of the space according to the first probability is calculated; Here we search the directions for which there is a local maximum of the second probability, said directions each corresponding to the direction of arrival of one of the signals.

The invention thus proposes a method capable of processing a plurality of sources. The method is functional, even when the number of sources is greater than the number of observations.

This process can be automated. In particular, it is not necessary a priori to indicate to the method the number of sources.

It can be expected that during the step Here, only the directions corresponding to a local maximum greater than a threshold are considered as directions of arrival of one of the signals. This threshold filtering also makes it possible to improve the robustness of the process in the face of false detection of the direction of arrival, whereas no source corresponds.

In the case, for example, where this parameter is unknown, the method may comprise a step IeI in which the first number of sources is determined by counting the number of local maxima of the second probability. Thus, it is possible to provide this information at the output of the method by calculation, without prior knowledge of this number.

In one embodiment of the present invention, the first probability is calculated by performing the following steps:

IdI the phase differences between the observations are determined; The first probability is obtained by calculating the level of correlation in the complex vector domain of a cost function relating to the ratio between the observations.

In particular, the first probability can be calculated by defining a weighting function of frequency ranges (G (ω)) and using the following mathematical expression:

The first probability thus obtained does not undergo any frequency jump, since the correlation level is calculated in the complex domain.

In one embodiment of the present invention, for each space direction the second probability is calculated by determining the maximum of the first probability over the time window. Alternatively, the second probability can be calculated by:

• determining, for each time frame, the dominant direction of arrival r (f) = argmax _/ ? (R, f);

r

• calculating a probability of dominance m {t) = mΑ \ p {τ, t);

Applying, to obtain the second probability, and for a value (ε) defining a degree of smoothing, the following mathematical expression:

μ (τ) = Σm (t) τ-ε <τ (t) <τ + ε.

EIT

The second number of frames in the time window can be chosen to be inversely proportional to the speed at which the sources are likely to move in space. This allows in particular to adapt the method according to the invention to the characteristics of the sources.

A second aspect of the present invention provides a device comprising means adapted to implement a method of determining the direction of arrival according to the first aspect of the present invention. A third aspect of the present invention provides an audio decoding device comprising an arrival direction estimator according to the second aspect of the present invention. By way of example, the decoding device can generate, from stereo streams without MP3 / AAC type auxiliary information, 5.1 or binaural type contents by identifying the sources present in the mixture as well as their directions of arrival.

A fourth aspect of the present invention provides a computer program including instructions for carrying out the method according to the first aspect of the present invention when the program is executed by a processor.

Other aspects, objects and advantages of the invention will appear on reading the description of one of its embodiments.

The invention will also be better understood with the aid of the drawings, in which:

Figure 1 illustrates a space having a plurality of sources; FIG. 2 illustrates the main steps of a method for determining the direction of arrival of the signals in space from the observations, implemented according to an embodiment of the present invention;

FIG. 3 illustrates the main steps of an embodiment of calculating the first probability p;

FIG. 4 shows, in a schematic diagram, a device for estimating the direction of arrival of signals, according to an embodiment of the present invention;

FIG. 5 illustrates a decoding device, according to an embodiment of the present invention. In the present description, and as illustrated in FIG. 1, three sources 10a, 10b, 10c, issuing respectively an Si, S signal, are considered. ₂ , Se, in space E. The sources can move over time. With respect to a point A of the space E, and at a given instant, the signal Si has a direction of arrival <9, the signal S ₂ a direction of arrival # ₂ , the signal S ₃ an arrival direction #,. Over a time range P, subdivided into frames T, observations xi (t) and x ₂ (t) have been carried out respectively at a point Oi and a point O ₂ of space E. The present description refers to illustrative only, of a number N of sources equal to 3 and a number M of observation equal to 2. It is easy to apply it to any other combination of number N of sources and number M of observations, M being greater than or equal to 2.

In an echoic space E, the observations xi (t) and x ₂ (t) can be modeled by a noisy convoluted mixture composed of the signals Si, S ₂ , S ₃ . By asking {a, {k)} the coefficients of the impulse response of the filter separating the Z ⁸ ^"16 source / sensor ^me, bj ^ f) diffuse independent additive noise sources, ^* is the symbol of the convolution, the observations xi (t) and x ₂ (t) can be modeled as follows: M0 = ΣΣM% ('- * ^{) +} ^ (' ⁾ = Σ>, **,) ⁽ ') ⁺ M0 ^j≈ uM.

In the frequency domain, by time-frequency transformation, by putting u (k) an apodizing window like Hanning window for example, we obtain:

X _J (t, ω) = Σu (k) x _J (t ₊ k) e ^ = f _i A _J , (ω) S _ι (t, ω) ₊ B _] (co) where A _j , (ω ) is the time-frequency transform of a _{y /} (k), B / ω) the time-frequency transform of b, (t), and S, (t, ω) the short-term time-frequency transform of s (t).

The signals Si, S ₂ , S ₃ have the characteristic that there exists for each of them at least one frame t _p of the time range P during which the energy of said signal is greater than the sum of the energy other signals. The source at the origin of said signal is then called dominant during this frame t _p . The following mathematical expression conveys this characteristic: with card {t _p )> \. It follows from this characteristic that for all the frames t _p , the observations Xi (f _p , ω) and X _Σ itp, ω) in the time-frequency domain can be approximated by the following mathematical expression:

*, M) ~ ^Λ _* H ⁵ , M) ^{+ β} >)

where B ' _j is the sum of B _j and residues of other non-dominant sources.

FIG. 2 illustrates the main steps of a method for determining the direction of arrival of the signals in space E from observations, implemented according to one embodiment of the present invention.

In a first step 50, from the M observations, one calculates for all the frames T, a first probability p of presence of one of the sources for each direction of arrival. Equivalently, the probability p can be calculated for a direction expressed with a vector or for a direction expressed with an angle. Similarly, it is possible to limit, according to the space E, the interval for which we will seek to obtain the first probability p. For example, it may be desirable to limit calculations to directions within a given cone of space. In addition, it is possible to define a spatial resolution step with which we will scan the space E, to limit the number of first probabilities p actually calculated.

In a second step 60, from the first probabilities p, a second probability μ of presence of one of the sources is calculated according to the direction of arrival, for a subset composed of a number g of T frames forming a time window F. The number g can be between 2 and the number of frames T included in the time range P. In the case where the sources are likely to move in the space E, the choice of the number g of frames T of the window F is a function of the speed of movement of the sources. In general, the higher the speed of the sources, the smaller the number g will be. Consequently, the second number g of frames T of the time window F can be chosen so as to be inversely proportional to the speed at which the sources are likely to move in the space E. If the sources move at different speeds, we can for example consider either the minimum speed, or the maximum speed or the average speed.

In a third step, the directions for which there is a local maximum of the second probability μ are sought, said directions each corresponding to the direction of arrival of one of the signals. By way of nonlimiting example, the local maximum can be obtained by first smoothing the second probability μ using, for example, a first-order low-pass filter and then looking for the directions for which the first derivative of the second probability μ gives a zero value and for which the second derivative of the second probability μ gives a negative value. One can then possibly calculate the energy of the second probability μ for a value range around said local maxima and eliminate the local maxima for which this energy is lower than a predetermined threshold.

In one embodiment of the present invention, the method includes an optional step 80 during which the number N of sources is determined by counting the number of local maxima of the second probability μ.

FIG. 3 illustrates the main steps of an embodiment of the calculation of the first probability p, implemented according to one embodiment of the present invention.

To calculate the first probability p in step 50, for a given frame T, the phase difference φ (t, ω) between the corresponding observations Xi and X ₂ is determined in a step 52. The phase difference φ {t, ω) is assumed to be linear, which is verified in practice, even in an echo space E. Thus, at the moments when a source is dominant relative to the others, the phase difference φ {t _p , co) between the points Oi or O ₂ can be modeled by the following mathematical expression: φ (t _p , ω) = aig R (t _p , ω) = Αrg-) '- = τ _p ω + h (t _p , ω) where R is the ratio between the observations Xi and X ₂ , and h is composed of the residues of the non-dominant sources, the diffuse additive noises and the reverberation. The phase difference φ (t _p , ω) is therefore linear as a function of the frequency ω, as long as h (t _p , ω) does not degrade the linearity of φ {t _p , ω).

For a given Sj source emitting a signal whose arrival direction is denoted θj, said signal S arrives at one of two points Oi and O ₂ with a time lag τ with respect to the other point Oi or O _2. The time offset τ can be estimated at - with the distance between the points Oi

vs

or O ₂ and c the celerity of the signal Si in the space E. It follows that the determination of the phase difference φ (t, ω) makes it possible to obtain the direction of arrival θj for each source.

In a step 54, the first probability p is obtained by calculating the correlation level in the complex vector domain of a cost function relating to the ratio R between the observations Xi and X _2. The probability p can be obtained by applying the following mathematical formula:

where the function p is a law of probability, for example a law of Poisson or Gauss, and G (ω) is a weighting function allowing to give more weight to certain frequencies according to the configuration of the space.

In a first embodiment, the second probability μ, for a given direction, calculated during the second step 60, can in turn be obtained by identifying the maximum of the first probability p over the time window F, which can result in the following mathematical expression: μ {τ) - max p (τ, t).

the F

In a second embodiment, the second probability μ, for a given direction, calculated during the second step 60, can in turn be obtained by determining for each time frame the dominant arrival direction equal to the position of the maximum on the directions studied r (f) = argmax _/ ? (r, f) and calculating the probability of this dominance equal to r

value of this maximum _m (t) = max p (τ, ή.

Thus, the second probability is given by the weighted histogram computed from the set of dominant arrival directions and their probability of dominance. μ (τ) = Σm (ή \ τ-ε≤τ (ή <τ + ε with ε a value

EIT

defining the degree of smoothing on the histogram.

FIG. 4 shows, in a schematic diagram, a device 100 for estimating the direction of arrival of signals, according to an embodiment of the present invention. The device 100 is particularly suitable for implementing the method according to the invention.

It comprises an input 100 enabling it to receive the observations Xi, X ₂ of the signals Si, S ₂ , S ₃ , over the time range P. A Time-Frequency transformation unit 106, for example a unit adapted to the implementation of a Fast Fourier Transform commonly known as "FFT", then makes it possible to work on the observations in the frequency domain, the observations noted X ₁ , X ₂ in the time domain being classically noted X ₁ , X ₂ in the domain frequency.

It comprises a calculation unit of the arrival direction 110. The latter is connected to the Time-Frequency transformation unit 106. It is adapted to calculate the first probability p of presence of one of the sources S for each direction. space E from observations X ₁ , X ₂ .

The device 100 comprises a temporal grouping unit 125, cooperating with the calculation unit of the arrival direction 110. This grouping unit 125 is adapted to calculate the second probability μ of presence of one of the sources for each direction. of space, as a function of the first probability p, and on the time window F.

The device 100 comprises an identification unit 130, cooperating with the temporal grouping unit 125, adapted to identify directions for which there is a local maximum of the second probability μ. The identification unit 130 is connected to the output 140 of the device 100 so as to be able to deliver the identified directions corresponding to the arrival directions 6> of the signals Sj.

The device may also include counting means 135 for outputting on the output 140 the first number N of sources by counting the number of local maxima of the second probability μ.

The device may also comprise parameterization means

120 adapted to modify, at the level of the temporal grouping unit 125, the second number g of frames T inversely proportional to the speed at which the sources are likely to move in space

E.

FIG. 5 illustrates, by a block diagram, an audio decoding device, according to an embodiment of the present invention.

Such a device is for example designed to notably create 5.1 type streams from a stereo stream without auxiliary information.

The decoding device 210 receives, as input, observations xi,..., X _N , typically a stereo signal derived from the AAC coder for example and containing Si signals emitted by a plurality of sources. The decoding device comprises a device 100 for estimating the direction of arrival of signals according to the invention, also receiving the observations X ₁ ,..., X _N. From the information provided by the device 100, the audio decoding device comprises the processing means 215 needed to generate multiple spatialized streams on an output 220 from the directions of arrival of the signals and possibly the number of sources.

Claims

1. Method of determining the direction of arrival (#,) of a first number

(N) of sound signals (Si) transmitted by sources (10), over a time range (P) subdivided into frames (T), in a space (E), from the knowledge of at least two observations (xi, X ₂ ) obtained using sensors, characterized in that it comprises the following steps:

/ a / for each frame (T), we compute (50), from the observations (X ₁ , X ₂ ), for each direction of a set of directions of the space (E), a first probability (p ) the presence of one of the sources; IbI for a second number (g) of frames (T) of a time window

(F), calculating (60) a second probability (μ) of presence of one of the sources (10) in each direction of a set of directions of the space (E) as a function of the first probability (p) ; Here we search (70) directions for which there is a local maximum of the second probability (μ), said directions each corresponding to the direction of arrival (U ₁ ) of one of the signals (Si);

2. Method according to claim 1, wherein during the step Here, only the directions corresponding to a local maximum greater than a threshold are considered as directions of arrival (θ,) of one of the signals (Si) .

3. Method according to any one of the preceding claims further comprising a step IeI where the first number (N) of sources (10) is determined by counting the number of local maxima of the second probability (μ). .

The method of any one of the preceding claims, wherein the first probability (p) is calculated (50) by performing the following steps:

IdJ we determine (52) the phase differences (φ {t, ω)) between the observations (xi, x ₂ );

IeI gives (54) the first probability (p), by calculating a cost function (p) based on the correlation level in the complex vector domain of the ratio (R) between the observations (xi, x ₂ ) and a model theoretical of this report.

The method of claim 4, wherein the first probability (p) is calculated by defining a frequency range weighting function (G (ω)) and using the following mathematical expression:

6. Method according to any one of the preceding claims, in which one calculates (60) for each direction of space (E) the second probability (μ) by determining the maximum of the first probability (p) over the time window. (F).

7. Method according to any one of claims 1 to 5, wherein one calculates (60) the second probability (μ) in:

• determining, for each time frame, the dominant direction of arrival r (f) = argmaχ /? (R, /);

r

Calculating a probability of dominance m (t) = m × a * p (τ, ή;

μ (τ) = Σm (t) \ r-ε≤τ (ή <τ + ε.

8. Device (100) for estimating the direction of arrival (U ₁ ), during a time range (P) subdivided into frames (T), in a space (E), of a first number ( N) of signals (Si) emitted respectively by the same first number (N) of sources (10), characterized in that it comprises:

An input (105) for receiving at least two observations (Xi, X ₂ ) of said signals (Si) over the time range (P);

A unit for calculating the arrival direction (110), able to receive the observations (xi, X ₂ ), said unit (1 10) being adapted to calculate a first probability (p) of presence of a sources for each direction of a set of directions of space (E);

A temporal grouping unit (125), cooperating with the computing unit of the arrival direction (110), adapted to calculate a second probability (μ) of the presence of one of the sources (10) for each direction of a set of directions of space (E), as a function of the first probability (p), and over a temporal window

(F) composed of a second number (g) of frames (T);

An identification unit (130), cooperating with the temporal grouping unit (125), adapted to identify directions for which there is a local maximum of the second probability (μ), said directions each corresponding to the direction of arrival (U ₁ ) of one of the signals (Si) and being delivered on an output (140).

9. Device according to claim 8, further comprising counting means (135) for delivering the first number (N) of sources (10) to the output (140) by counting the number of local maxima of the second probability (μ). ).

10. Device according to any one of claims 8 to 9, comprising setting means (120) adapted to modify, at the temporal grouping unit (125), the second number (g) of frames (T). of inversely proportional to the speed at which the sources (10) are able to move in the space (E).

1 1. Audio decoding device comprising an input for receiving at least two observations (X ₁ , X ₂ ) of signals (Si) emitted by sources (10), said device comprising a device for estimating the direction of arrival (θ _t ) according to any one of claims 8 to 10.

12. Computer program comprising instructions for implementing the method according to any one of claims 1 to 7 when the program is executed by a processor.