EP2901447B1

EP2901447B1 - Method and device for separating signals by minimum variance spatial filtering under linear constraint

Info

Publication number: EP2901447B1
Application number: EP13770877.2A
Authority: EP
Inventors: Sylvain Marchand; Stanislaw GORLOW
Original assignee: Centre National de la Recherche Scientifique CNRS; Universite des Sciences et Tech (Bordeaux 1)
Current assignee: Centre National de la Recherche Scientifique CNRS; Universite des Sciences et Tech (Bordeaux 1)
Priority date: 2012-09-27
Filing date: 2013-09-25
Publication date: 2016-12-21
Anticipated expiration: 2033-09-25
Also published as: FR2996043A1; FR2996043B1; JP2015530619A; US9437199B2; JP6129321B2; WO2014048970A1; US20150243290A1; EP2901447A1

Description

La présente invention concerne un procédé destiné à séparer certains des signaux sources composant un signal global numérique audio. L'invention concerne également un dispositif destiné à mettre en oeuvre ce procédé.The present invention relates to a method for separating some of the source signals comprising a digital audio overall signal. The invention also relates to a device for implementing this method.

Le mixage de signaux consiste à sommer plusieurs signaux, appelés signaux sources, pour obtenir un ou plusieurs signaux composites, appelés signaux mixés. Dans les applications audio notamment, le mixage peut consister en une simple étape d'addition des signaux sources ou peut également comprendre des étapes de filtrage des signaux avant et/ou après l'addition. Par ailleurs, pour certaines applications telles que le compact-disc audio, les signaux sources peuvent être mixés de manière différente pour former deux signaux mixés correspondant aux deux voies ou canaux (gauche et droite) d'un signal stéréo.Signal mixing consists of summing several signals, called source signals, to obtain one or more composite signals, called mixed signals. In audio applications in particular, the mixing may consist of a simple step of adding the source signals or may also include signal filtering steps before and / or after the addition. On the other hand, for some applications such as compact disc audio, the source signals can be mixed differently to form two mixed signals corresponding to the two channels or channels (left and right) of a stereo signal.

La séparation de sources consiste à estimer des signaux sources à partir de l'observation d'un certain nombre de signaux mixés différents formés à partir de ces mêmes signaux sources. L'objectif est généralement de rehausser, voire si possible d'extraire complètement un ou plusieurs signaux sources cibles. La séparation de sources est notamment difficile dans les cas dits « sous-déterminés » dans lesquels on dispose d'un nombre de signaux mixés inférieur au nombre des signaux sources présents dans les signaux mixés. L'extraction est dans ce cas très difficile voire impossible en raison de la faible quantité d'information disponible dans ces signaux mixés par rapport à celle présente dans les signaux sources. Les signaux de musique sur compact-disc audio en sont un exemple particulièrement représentatif car on ne dispose que de deux canaux stéréo (c'est-à-dire deux signaux mixés gauche et droite), généralement très redondants, pour un grand nombre potentiel de signaux sources.Separation of sources consists of estimating source signals from the observation of a certain number of different mixed signals formed from these same source signals. The objective is generally to enhance, if possible to completely extract one or more target source signals. The separation of sources is particularly difficult in so-called "under-determined" cases in which there is a number of mixed signals less than the number of source signals present in the mixed signals. The extraction is in this case very difficult or impossible because of the small amount of information available in these mixed signals compared to that present in the source signals. Music signals on compact-disc audio are a particularly representative example because only two stereo channels (ie two mixed left and right signals), which are generally very redundant, are available for a large number of potential speakers. source signals.

Il existe plusieurs types d'approches dans la séparation de signaux sources : parmi elles la séparation aveugle, l'analyse de scènes auditives computationnelle, et la séparation basée sur des modèles. La séparation aveugle est la forme la plus générale, dans laquelle aucune information sur les signaux sources ni sur la nature des signaux mixés n'est connue à priori. On fait alors un certain nombre d'hypothèses sur ces signaux sources et les signaux mixés (par exemple que les signaux sources sont statistiquement indépendants) et on estime les paramètres d'un système de séparation en maximisant un critère basé sur ces hypothèses (par exemple en maximisant l'indépendance des signaux obtenus par le dispositif de séparation). Cependant, cette méthode est utilisée généralement dans les cas où l'on dispose de nombreux signaux mixés (au moins autant que de signaux sources) et n'est donc pas applicable aux cas sous-déterminés dans lesquels le nombre de signaux mixés est inférieur au nombre de signaux sources.There are several types of approaches in the separation of source signals: among them the blind separation, the analysis of scenes Computational auditory, and model-based separation. Blind separation is the most general form, in which no information on the source signals nor on the nature of the mixed signals is known a priori. We then make a number of assumptions about these source signals and the mixed signals (for example that the source signals are statistically independent) and we estimate the parameters of a separation system by maximizing a criterion based on these hypotheses (for example maximizing the independence of the signals obtained by the separation device). However, this method is generally used in cases where there are many mixed signals (at least as much as source signals) and is therefore not applicable to under-determined cases in which the number of mixed signals is less than number of source signals.

L'analyse de scènes auditives computationnelle consiste généralement en une modélisation des signaux sources en partiels, mais le signal mixé n'est pas décomposé explicitement. Cette méthode se base sur les mécanismes du système auditif humain pour séparer les signaux sources de la même façon que le fait notre oreille. On peut notamment citer : D.P.W. Ellis, Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis, and its application to speechlnon-speech mixture (Speech Communication, 27(3), pp. 281-298, 1999 ), D. Godsmark et G.J.Brown, A blackboard architecture for computational auditory scene analysis (Speech Communication, 27(3), pp. 351-366, 1999 ), de même que T. Kinoshita, S. Sakai, et H. Tanaka, Musical sound source identification based on frequency component adaptation (In Proc. IJCAI Workshop on CASA, pp. 18-24, 1999 ). Cependant, l'analyse de scènes auditives computationnelle conduit actuellement à des résultats insuffisants en terme de qualité des signaux sources séparés.Computational auditory scene analysis usually consists of modeling partial source signals, but the mixed signal is not explicitly decomposed. This method is based on the mechanisms of the human auditory system to separate the source signals in the same way that our ear does. These include: DPW Ellis, Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis, and its application to speechlnon-speech mixture (Speech Communication, 27 (3), pp. 281-298, 1999 ) D. Godsmark and GJBrown, A blackboard architecture for computational auditory scene analysis (Speech Communication, 27 (3), pp. 351-366, 1999 ), in the same way T. Kinoshita, S. Sakai, and H. Tanaka, Musical sound source identification based on frequency adaptation (In Proc. IJCAI Workshop on CASA, pp. 18-24, 1999 ). However, computational auditory scene analysis currently leads to insufficient results in terms of the quality of the separate source signals.

Une autre forme de séparation repose sur une décomposition du mélange sur une base de fonctions adaptées. Il en existe deux grandes catégories : la décomposition parcimonieuse temporelle et la décomposition parcimonieuse en fréquence.Another form of separation relies on a decomposition of the mixture on the basis of suitable functions. There are two main categories: the temporary parsimonious decomposition and the parsimonious decomposition in frequency.

Pour la première, il s'agit de décomposer la forme d'onde du mélange, et pour l'autre il s'agit de décomposer sa représentation spectrale, en une somme de fonctions élémentaires appelées « atomes » éléments d'un dictionnaire. Divers algorithmes permettent de choisir le type de dictionnaire et la décomposition correspondante la plus vraisemblable. Pour le domaine temporel, on peut citer notamment : L. Benaroya, Représentations parcimonieuses pour la séparation de sources avec un seul capteur (Proc. GRETSI, 2001 ), ou P.J. Wolfe et S.J. Godsill, A Gabor régression scheme for audio signal analysis (Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 103-106, 2003 ). Dans la méthode proposée par Gribonval ( R. Gribonval and E. Bacry, Harmonic Decomposition of Audio Signals With Matching Poursuit, IEEE Trans. Signal Proc., 51(1), pp. 101-112, 2003 ), on classe les atomes de décomposition en sous-espaces indépendants, ce qui permet d'extraire des groupes de partiels harmoniques. Une des restrictions de cette méthode est que des dictionnaires génériques d'atomes tels que les atomes de Gabor par exemple, non adaptés aux signaux, ne donnent pas de bons résultats. De plus, pour que ces décompositions soient efficaces, il faut que le dictionnaire contienne toutes les formes translatées des formes d'ondes de chaque type d'instrument. Les dictionnaires de décomposition doivent alors être extrêmement volumineux pour que la projection et donc la séparation soient efficaces.For the first, it is a question of decomposing the waveform of the mixture, and for the other it is a question of decomposing its spectral representation, in a sum of elementary functions called "atoms" elements of a dictionary. Various algorithms allow to choose the type of dictionary and the corresponding decomposition most likely. For the time domain, mention may in particular be made of: L. Benaroya, Sparse representations for source separation with a single sensor (GRETSI Proceedings, 2001) ), or PJ Wolfe and SJ Godsill, A Gabor Regression scheme for audio signal analysis (IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 103-106, 2003). ). In the method proposed by Gribonval ( R. Gribonval and E. Bacry, Harmonic Decomposition of Audio Signals with Matching Pursuit, IEEE Trans. Signal Proc., 51 (1), pp. 101-112, 2003 ), we classify the atoms of decomposition in independent subspaces, which makes it possible to extract groups of harmonic partials. One of the restrictions of this method is that generic dictionaries of atoms such as the Gabor atoms, for example, not adapted to the signals, do not give good results. Moreover, for these decompositions to be effective, the dictionary must contain all the translated forms of the waveforms of each type of instrument. The decomposition dictionaries must then be extremely large for projection and thus separation to be effective.

Pour pallier à ce problème d'invariance par translation qui apparaît dans le cas temporel, il existe des approches de décomposition parcimonieuse en fréquence. On peut citer notamment M.A. Casey et A. Westner (Separation of mixed audio sources by independent subspace analysis, Proc. Int. Computer Music Conf., 2000 ) qui ont introduit l'analyse en sous-espaces indépendants (ISA). Cette analyse consiste à décomposer le spectre d'amplitude à court terme du signal mixé (calculé par transformée de Fourier à court terme (TFCT)) sur une base d'atomes, et ensuite à regrouper les atomes en sous-espaces indépendants, chaque sous-espace étant propre à une source, pour ensuite resynthétiser les sources séparément. Cependant, cette approche est généralement limitée par plusieurs facteurs : la résolution de l'analyse spectrale par TFCT, la superposition des sources dans ce domaine spectral, et la restriction de la séparation spectrale à l'amplitude (la phase des signaux resynthétisés étant celle du signal mixé). Il est ainsi généralement difficile de représenter le signal mixé comme une somme de sous-espaces indépendants du fait de la complexité de la scène sonore dans le domaine spectral (imbrication forte des différentes composantes) et en raison de l'évolution, en fonction du temps, de la contribution de chaque composante dans le signal mixé. De fait, les méthodes sont souvent évaluées sur des signaux mixés « simplifiés » bien contrôlés (les signaux sources sont des instruments MIDI ou sont des instruments relativement bien séparables, en nombre restreint).To overcome this problem of invariance by translation which appears in the temporal case, there are approaches of parsimonious decomposition in frequency. We can mention in particular MA Casey and A. Westner (Separation of mixed audio sources by independant subspace analysis, Proc. Int. Computer Music Conf., 2000 ) who introduced the analysis in independent subspace (ISA). This analysis consists of breaking down the short-term amplitude spectrum of the mixed signal (calculated by short-term Fourier transform (TFCT)) on an atomic basis, and then grouping the atoms into independent subspaces, each sub-space. space being specific to a source, to then resynthesize the sources separately. However, this approach is generally limited by several factors: the resolution of the spectral analysis by TFCT, the superposition of the sources in this spectral domain, and the restriction of the spectral separation to the amplitude (the phase of the resynthesized signals being that of the mixed signal ). It is thus generally difficult to represent the mixed signal as a sum of independent subspaces because of the complexity of the sound scene in the spectral domain (strong interweaving of the different components) and because of the evolution as a function of time , the contribution of each component in the mixed signal. In fact, the methods are often evaluated on well-controlled "simplified" mixed signals (the source signals are MIDI instruments or are relatively well separable instruments, in limited numbers).

Une autre méthode de séparation de sources est la séparation de sources « informée » : des informations relatives à un ou plusieurs signaux sources sont transmises avec le signal mixé au décodeur. Le décodeur est alors capable, à partir d'algorithmes et desdites informations, de séparer au moins partiellement au moins un signal source du signal mixé. Un exemple de séparation de sources informée est décrit par M. Parvaix et L. Girin (Informed source separation of linear instantaneous under-determined audio mixtures by source index embedding, IEEE trans. Audio Speech Lang. Process., volume 19, pages 1721-1733, août 2011 ). L'information transmise au décodeur indique notamment les deux signaux sources prédominants dans le signal mixé, pour différentes zones fréquentielles. Cependant, une telle méthode n'est pas toujours adaptée lorsqu'il existe plus de deux signaux sources contribuant simultanément dans une même zone fréquentielle du signal mixé : dans ce cas, au moins un signal source est négligé, créant ainsi un « trou spectral » dans la reconstitution dudit signal source.Another method of source separation is "informed" source separation: information relating to one or more source signals is transmitted with the mixed signal to the decoder. The decoder is then able, from algorithms and information, to at least partially separate at least one source signal of the mixed signal. An example of informed source separation is described by M. Parvaix and L. Girin (IEEE trans. Audio Speech Lang. Process., Volume 19, pages 1721-1733, August 2011). ). The information transmitted to the decoder indicates in particular the two predominant source signals in the mixed signal, for different frequency zones. However, such a method is not always suitable when there are more than two source signals contributing simultaneously in the same frequency zone of the mixed signal: in this case, at least one source signal is neglected, thus creating a "spectral hole" in the reconstitution of said source signal.

Il est également connu, notamment dans le domaine des télécommunications, de filtrer des signaux captés par une pluralité de capteurs, en fonction de la position dans l'espace desdits signaux par rapport auxdits capteurs. Il s'agit ainsi d'un filtrage spatial (ou encore « beamforming ») qui permet de privilégier le signal dans une direction de l'espace donnée et de filtrer les signaux issus d'autres directions. Un exemple de tels filtres sont les filtres spatiaux à variance minimum sous contrainte linéaire (en anglais : « linearly constrained minimum variance (LCMV) »). Un exemple d'un tel filtre est notamment divulgué dans le document EP 1 633 121 .It is also known, particularly in the field of telecommunications, to filter signals picked up by a plurality of sensors, as a function of the position in the space of said signals relative to said sensors. This is a spatial filtering (or Beamforming) which makes it possible to privilege the signal in a direction of the given space and to filter the signals coming from other directions. An example of such filters are the linear minimum constrained variance (LMCV) spatial filters. An example of such a filter is disclosed in particular in the document EP 1 633 121 .

Un but de la présente invention est donc de proposer un procédé permettant de séparer des signaux sources compris dans un ou plusieurs signaux mixés, de manière plus efficace.An object of the present invention is therefore to provide a method for separating source signals included in one or more mixed signals, more effectively.

A cet effet, dans un mode de réalisation, il est proposé un procédé pour séparer, au moins partiellement, un ou plusieurs signaux sources numériques audio particuliers contenus dans un signal mixé numérique audio multicanal (c'est-à-dire comprenant au moins deux canaux), par exemple stéréo. Le signal mixé est obtenu par mixage de plusieurs signaux sources numériques audio et comprend des valeurs représentatives du ou des signaux sources particuliers. Selon le procédé :

on détermine le module de l'amplitude ou la puissance normalisée du ou des signaux sources particuliers à partir des valeurs représentatives dudit ou desdits signaux sources particuliers contenues dans le signal mixé, puis
on effectue un filtrage spatial à variance minimale sous contrainte linéaire du signal mixé pour obtenir au moins partiellement chaque signal source particulier, ledit filtrage étant basé sur la répartition dudit signal source particulier entre au moins deux canaux du signal mixé, et le module de l'amplitude ou la puissance normalisée dudit signal source particulier étant utilisé comme contrainte linéaire du filtre.

For this purpose, in one embodiment, there is provided a method for separating, at least partially, one or more particular digital audio source signals contained in a multichannel digital audio mix signal (i.e. comprising at least two channels), for example stereo. The mixed signal is obtained by mixing several digital audio source signals and includes values representative of the particular source signal or signals. According to the method:

the modulus of the amplitude or the normalized power of the particular source signal or signals is determined from the values representative of the at least one particular source signal contained in the mixed signal, and then
linearly minimally space-constrained spatial filtering of the mixed signal to at least partially obtain each particular source signal, said filtering being based on the distribution of said particular source signal between at least two channels of the mixed signal, and the module of the amplitude or the normalized power of said particular source signal being used as a linear constraint of the filter.

Les valeurs représentatives peuvent être la répartition temporelle, spectrale ou spectro-temporelle du signal source particulier, ou la contribution temporelle, spectrale ou spectro-temporelle du signal source particulier dans le signal mixé. Les valeurs représentatives des signaux sources peuvent ainsi être en module de l'amplitude ou en puissance normalisée (c'est-à-dire en énergie, qui correspond au carré du module de l'amplitude) : les valeurs représentatives peuvent donc être les valeurs en module de l'amplitude ou les valeurs de puissance normalisée (ou d'énergie).The representative values may be the temporal, spectral or spectro-temporal distribution of the particular source signal, or the temporal, spectral or spectro-temporal contribution of the particular source signal in the mixed signal. The representative values of the source signals can thus be in amplitude modulus or in normalized power (that is to say in energy, which corresponds to the square of the modulus of the amplitude): the representative values can thus be the values in modulus of the amplitude or the values of normalized power (or of energy).

Les valeurs représentatives peuvent par exemple être la répartition temporelle, spectrale ou spectro-temporelle du signal source particulier, ou la contribution temporelle, spectrale ou spectro-temporelle du signal source particulier dans le signal mixé, pour plusieurs zones (ou points) d'un plan temps-fréquence. Dans ce cas, la détermination du module de l'amplitude ou de la puissance normalisée du ou des signaux sources particuliers peut se faire dans le plan temps-fréquence : les modules de l'amplitude et les puissances normalisées sont des valeurs spectro-temporelles.Representative values may for example be the temporal, spectral or spectro-temporal distribution of the particular source signal, or the temporal, spectral or spectro-temporal contribution of the particular source signal in the mixed signal, for several zones (or points) of a particular source signal. time-frequency plan. In this case, the determination of the amplitude or normalized power module of the particular source signal or signals can be done in the time-frequency plane: the amplitude modules and the normalized powers are spectro-temporal values.

Une transformation ou une représentation dans le plan temps-fréquence consiste à représenter, en énergie (ou puissance normalisée) ou en module de l'amplitude (c'est-à-dire la racine carrée de l'énergie), le signal source en fonction de deux paramètres, le temps et la fréquence. Cela correspond à l'évolution, en énergie ou en module, du contenu fréquentiel du signal source en fonction du temps. On obtient ainsi, pour un instant donné et une fréquence donnée, une valeur positive réelle correspondant aux composantes du signal à cette fréquence et à cet instant. Des exemples de formulations théoriques et de mises en oeuvre pratiques de représentations temps-fréquence sont déjà décrites ( L. Cohen : Time-Frequency Distributions, a Review, Proceedings of the IEEE, vol. 77, N° 7, 1989 ; F. Hlawatsch, F. Auger : Temps-fréquence, concepts et outils, Hermès Science, Lavoisier 2005 ; P. Flandrin : Temps Fréquence, Hermès Science, 1998 ).A transformation or representation in the time-frequency plane consists in representing, in energy (or normalized power) or in modulus of the amplitude (that is to say the square root of the energy), the source signal in function of two parameters, time and frequency. This corresponds to the evolution, in energy or in module, of the frequency content of the source signal as a function of time. Thus, for a given instant and a given frequency, a real positive value corresponding to the signal components at this frequency and at this instant is obtained. Examples of theoretical formulations and practical implementations of time-frequency representations are already described ( L. Cohen: Time-Frequency Distributions, Review, Proceedings of the IEEE, vol. 77, No. 7, 1989 ; F. Hlawatsch, F. Auger: Time-frequency, concepts and tools, Hermès Science, Lavoisier 2005 ; P. Flandrin: Time Frequency, Hermès Science, 1998 ).

Ainsi, grâce au procédé décrit, il est possible de séparer efficacement, avec un filtrage spatial amélioré par les informations contenues dans le signal mixé, les signaux sources particuliers, sans faire d'hypothèse sur ces différents signaux (à l'exception des hypothèses statistiques classiques, i.e. : indépendance des signaux sources, moyenne nulle de ces signaux sources, distribution gaussienne). En particulier, le procédé est basé sur la répartition de chaque signal source entre les différents canaux du signal mixé pour isoler lesdits signaux sources (filtrage spatial). L'utilisation d'un filtre à variance minimum sous contrainte linéaire permet d'obtenir une séparation spatiale performante, en faisant intervenir, comme contrainte, le module de l'amplitude ou la puissance normalisée du signal source. On peut ainsi décorréler spatialement un signal source particulier du signal mixé, et ajuster l'amplitude du signal séparé au niveau souhaité dans le même temps. On améliore donc l'étape de filtrage spatial en prenant en considération la valeur représentative du signal source particulier dont on a connaissance.Thus, thanks to the method described, it is possible to effectively separate, with spatial filtering improved by the information contained in the mixed signal, the particular source signals, without making any assumption on these different signals (with the exception of the statistical hypotheses classics, ie independence of the source signals, zero mean of these source signals, Gaussian distribution). In particular, the method is based on the distribution of each source signal between the different channels of the mixed signal to isolate said source signals (spatial filtering). The use of a linear variance minimum variance filter makes it possible to obtain a high-performance spatial separation, by using the modulus of amplitude or the normalized power of the source signal as a constraint. It is thus possible to spatially decorrelate a particular source signal of the mixed signal, and to adjust the amplitude of the separated signal to the desired level at the same time. The spatial filtering step is therefore improved by taking into consideration the representative value of the particular source signal of which we know.

Il est notamment possible d'isoler simultanément les différents signaux sources particuliers présents dans le signal mixé, par exemple en mettant en oeuvre autant de filtres spatiaux qu'il y a de signaux sources à séparer.It is in particular possible to simultaneously isolate the different particular source signals present in the mixed signal, for example by using as many spatial filters as there are source signals to be separated.

Préférentiellement, le filtrage est également basé sur le module de l'amplitude ou la puissance normalisée des signaux sources particuliers. Plus précisément, l'étape de filtrage spatial peut comprendre la modélisation d'une matrice de corrélation spatiale utilisant le module de l'amplitude ou la puissance normalisée des signaux sources particuliers et la répartition dudit signal source particulier entre au moins deux canaux du signal mixé.Preferably, the filtering is also based on the amplitude module or the normalized power of the particular source signals. More specifically, the spatial filtering step may comprise modeling a spatial correlation matrix using the amplitude module or the normalized power of the particular source signals and the distribution of said particular source signal between at least two channels of the mixed signal. .

Préférentiellement, le signal mixé comprend des valeurs représentatives du ou des signaux sources particuliers pour au moins deux canaux du signal mixé, et, avant d'effectuer le filtrage spatial, on détermine, à partir du signal mixé et desdites valeurs représentatives des signaux sources particuliers, la répartition de chaque signal source particulier entre lesdits au moins deux canaux du signal mixé.Preferably, the mixed signal comprises values representative of the particular source signal or signals for at least two channels of the mixed signal, and, before performing the spatial filtering, from the mixed signal and from said representative values, particular source signals are determined. , the distribution of each particular source signal between said at least two channels of the mixed signal.

Alternativement, on peut recevoir en entrée, par exemple avec le signal mixé, la répartition du ou des signaux sources particuliers entre au moins deux canaux du signal mixé.Alternatively, it is possible to receive as input, for example with the mixed signal, the distribution of the particular source signal or signals between at least two channels of the mixed signal.

Autrement dit, la répartition des signaux sources particuliers entre les différents canaux du signal mixé peut être fournie lors de la mise en oeuvre du procédé de séparation, par exemple en même temps que les valeurs représentatives desdits signaux sources particuliers, ou bien peut être déterminée pendant le procédé de séparation, à partir du signal mixé multicanal et des valeurs représentatives des signaux sources particuliers.In other words, the distribution of the particular source signals between the different channels of the mixed signal can be provided during the implementation of the separation method, for example at the same time as the representative values of said particular source signals, or Well can be determined during the separation process from the multichannel mixed signal and the representative values of the particular source signals.

Selon un mode de mise en oeuvre, la détermination du module de l'amplitude ou de la puissance normalisée du ou des signaux sources particuliers comprend l'extraction des valeurs représentatives du ou des signaux sources particuliers qui ont été insérées dans le signal mixé, par exemple par tatouage. L'extraction des valeurs représentatives découle de la transmission des valeurs représentatives des signaux sources particuliers, qui peut se faire avec le signal mixé, par exemple lorsque les informations sont tatouées ou insérées de manière inaudible, dans le signal mixé, ou bien par un canal particulier du signal mixé qui est dédié à la transmission desdites valeurs représentatives.According to one embodiment, the determination of the modulus of the amplitude or of the normalized power of the particular source signal or signals comprises the extraction of the values representative of the particular source signal or signals which have been inserted in the mixed signal, by example by tattoo. The extraction of the representative values results from the transmission of the representative values of the particular source signals, which can be done with the mixed signal, for example when the information is tattooed or inserted in an inaudible manner, in the mixed signal, or by a channel particular of the mixed signal which is dedicated to the transmission of said representative values.

Selon un autre aspect, il est proposé un dispositif pour séparer, au moins partiellement, un ou plusieurs signaux sources numériques audio particuliers contenus dans un signal mixé numérique audio multicanal. Le signal mixé est obtenu par mixage de plusieurs signaux sources numériques audio et comprend des valeurs représentatives du ou des signaux sources particuliers. Le dispositif comprend :

un moyen de détermination du module de l'amplitude ou de la puissance normalisée du ou des signaux sources particuliers à partir des valeurs représentatives dudit ou desdits signaux sources particuliers contenues dans le signal mixé, et
un filtre spatial à variance minimale sous contrainte linéaire apte à isoler au moins partiellement, à partir du signal mixé, chaque signal source particulier, ledit filtre étant basé sur la répartition dudit signal source particulier entre au moins deux canaux du signal mixé, et le module de l'amplitude ou la puissance normalisée dudit signal source particulier étant utilisé comme contrainte linéaire.

In another aspect, there is provided a device for separating, at least partially, one or more particular digital audio source signals contained in a mixed digital audio multichannel signal. The mixed signal is obtained by mixing several digital audio source signals and includes values representative of the particular source signal or signals. The device comprises:

means for determining the modulus of the amplitude or the normalized power of the particular source signal or signals from the values representative of the at least one particular source signal contained in the mixed signal, and
a linear constrained minimum variance spatial filter adapted to at least partially isolate, from the mixed signal, each particular source signal, said filter being based on the distribution of said particular source signal between at least two channels of the mixed signal, and the module amplitude or normalized power of said particular source signal being used as a linear constraint.

Préférentiellement, le signal mixé est un signal stéréo.Preferably, the mixed signal is a stereo signal.

Préférentiellement, le signal mixé comprend des valeurs représentatives du ou des signaux sources particuliers pour au moins deux canaux du signal mixé, et le dispositif comprend un moyen de détermination, à partir du signal mixé et desdites valeurs représentatives des signaux sources particuliers, de la répartition de chaque signal source particulier entre lesdits au moins deux canaux du signal mixé.Preferably, the mixed signal comprises values representative of the particular source signal or signals for at least two channels of the mixed signal, and the device comprises a means of determining, from the mixed signal and said representative values of the particular source signals, the distribution of each particular source signal between said at least two channels of the mixed signal.

Préférentiellement, le moyen de détermination du module de l'amplitude ou de la puissance normalisée comprend un moyen d'extraction des valeurs représentatives du ou des signaux sources particuliers qui ont insérées dans le signal mixé, par exemple par tatouage.Preferably, the means for determining the amplitude or the normalized power module comprises means for extracting the values representative of the particular source signal or signals which have inserted into the mixed signal, for example by tattooing.

L'invention sera mieux comprise à l'étude d'un mode de réalisation particulier, pris à titre d'exemple nullement limitatif et illustré par les dessins annexés, sur lesquels :

la figure 1 représente schématiquement un mode de réalisation d'un dispositif de séparation selon l'invention ; et
la figure 2 est un organigramme d'un procédé de séparation selon l'invention.

The invention will be better understood from the study of a particular embodiment, taken by way of non-limiting example and illustrated by the appended drawings, in which:

the figure 1 schematically represents an embodiment of a separation device according to the invention; and
the figure 2 is a flowchart of a separation process according to the invention.

Dans la suite de la description détaillée, on considère que le signal mixé s_mix(t) est un signal stéréo avec un canal gauche s_mix ^g(t) et un canal droite s_mix ^d(t), et comprenant p signaux sources s₁(t), ... , s_p(t). Le signal mixé s_mix(t) peut s'écrire comme le produit des p signaux sources par une matrice de mixage A : $A = [{a_{1}}^{g}, \dots, {a_{p}}^{g}] = [a_{1}, \dots, a_{p}] [{a_{1}}^{d}, \dots, {a_{p}}^{d}]$

où a_i=[a_i ^g , a_i ^d]^T (^T représentant la transposée de la matrice) et a_i ^g et a_i ^d représentent la répartition du signal source i dans chaque canal du signal mixé : (a_i ^g)² + (a_i ^d)² =1.In the remainder of the detailed description, it is considered that the mixed signal s _mix (t) is a stereo signal with a left channel s _mix ^g (t) and a right channel s _mix ^d (t), and comprising p source signals s ₁ (t), ..., s _p (t). The mixed signal s _mix (t) can be written as the product of the p source signals by a mixing matrix A:

AT = [{at}_{1}^{boy Wut}, ..., {at}_{p}^{boy Wut}] = [{at}_{1}, ..., {at}_{p}] [{at}_{1}^{d}, ..., {at}_{p}^{d}]

where a _i = [a _i ^g , a _i ^d ] ^T ( ^T representing the transpose of the matrix) and a _i ^g and a _i ^d represent the distribution of the source signal i in each channel of the mixed signal: (a _i ^g ) ² + (a _i ^d ) ² = 1.

Plus précisément, les coefficients a_i ^g et a_i ^d peuvent s'écrire sous la forme suivante : a_i ^g= sin(θ_i) et a_i ^d= cos(θ_i), où θ_i représente la balance du signal source i entre les deux canaux du signal mixé.More precisely, the coefficients a _i ^g and a _i ^d can be written in the following form: a _i ^g = sin (θ _i ) and a _i ^d = cos (θ _i ), where θ _i represents the balance of the source signal i between the two channels of the mixed signal.

Autrement dit, on a : $s_{mix} (t) = A . s (t)$

avec : s_mix(t)=[s_mix ^g(t),s_mix ^d(t)]^T et s(t)=[s₁(t),...,s_p(t)]^T (^T représentant la transposée).In other words, we have:

s_{mix} (t) = AT . s (t)

with: s _mix (t) = [s _mix ^g (t), s _mix ^d (t)] ^T and s (t) = [s ₁ (t), ..., s _p (t)] ^T ( ^T representing the transpose).

Par ailleurs, on considère dans la suite de la description, que les signaux sont des signaux audio.Furthermore, it is considered in the remainder of the description, that the signals are audio signals.

On considère, dans le cas de la présente description, la transformation de Fourier à court terme comme transformation dans le plan temps-fréquence. La transformée du signal source i dans le plan temps-fréquence s'écrit ainsi sous la forme : $S_{i} (k, m) = \sum s_{i} (k + n) f (n) e^{- 2 iπmn / N}$

où N est une constante et f(n) est une fonction de fenêtre de la transformée de Fourier à court terme.In the case of this description, the short-term Fourier transformation is considered to be a transformation in the time-frequency plane. The transform of the source signal i in the time-frequency plane is thus written in the form:

S_{i} (k, m) = Σ s_{i} (k + not) f (not) e^{- 2 iπmn / NOT}

where N is a constant and f (n) is a window function of the short-term Fourier transform.

On considère dans la suite de la description que la contrainte linéaire du filtre spatial est la puissance normalisée. Pour un signal source s_i donné, et un point (k,m) du plan temps-fréquence donné, on obtient donc comme énergie ou puissance normalisée ϕ_i(k,m) : $ϕ_{i} (k, m) = {|S_{i} (k, m)|}^{2}$

It is considered in the remainder of the description that the linear constraint of the spatial filter is the normalized power. For a given source signal s _i , and a point (k, m) of the given time-frequency plane, one thus obtains as normalized energy or power φ _i (k, m):

φ_{i} (k, m) = {|S_{i} (k, m)|}^{2}

La valeur représentative du signal source peut ainsi être |S_i(k,m)| (valeur en module) ou bien ϕ_i(k,m) (valeur en énergie égale à la valeur en puissance normalisée). La valeur représentative du signal source peut également être le logarithme de la valeur en énergie : $Φ_{i} = 10 \log_{10} (ϕ_{i} (k, m))$

The representative value of the source signal can thus be | S _i (k, m) | (value in module) or else φ _i (k, m) (energy value equal to the value in standardized power). The representative value of the source signal can also be the logarithm of the energy value:

Φ_{i} = 10 \log_{10} (φ_{i} (k, m))

La valeur représentative du signal source peut également être déterminée après avoir effectué des traitements sur le signal source, par exemple en réduisant la résolution fréquentielle du spectre en énergie ou bien encore en adaptant la quantification des valeurs représentatives à la sensibilité de l'oreille humaine. Il est alors possible d'obtenir des valeurs représentatives des signaux sources qui sont moins volumineuses en terme de taille, tout en gardant une qualité sonore voulue.The representative value of the source signal can also be determined after processing on the source signal, for example by reducing the frequency resolution of the energy spectrum or by adapting the quantification of the representative values to the sensitivity of the human ear. It is then possible to obtain representative values of the source signals which are less bulky in terms of size, while maintaining a desired sound quality.

On considère dans la suite de la description que la valeur représentative des signaux sources est la valeur en puissance normalisée (ou énergie) quantifiée Φ_i(k,m).It is considered in the remainder of the description that the representative value of the source signals is the quantized normalized power value (or energy) Φ _i (k, m).

Les valeurs représentatives des signaux sources Φ_i(k,m) sont transmises au dispositif de séparation ou décodeur. Elles peuvent l'être par un canal dédié (associé aux canaux stéréo pour former le signal mixé), ou par incorporation dans le signal mixé, par exemple par tatouage ou par utilisation de bits non-utilisés du signal mixé. Dans ce dernier cas, le dispositif de séparation peut comprendre un moyen d'extraction des valeurs représentatives, recevant en entrée le signal mixé et fournissant en sortie, les valeurs représentatives des signaux sources.The representative values of the source signals Φ _i (k, m) are transmitted to the separation device or decoder. They can be by a dedicated channel (associated with the stereo channels to form the mixed signal), or by incorporation in the mixed signal, for example tattooing or using unused bits of the mixed signal. In the latter case, the separation device may comprise means for extracting the representative values, receiving as input the mixed signal and outputting the representative values of the source signals.

De même, le dispositif de séparation peut également recevoir les répartitions des signaux sources dans chaque voie (ou canal) du signal mixé : a₁ ^g, ...a_p ^g, a₁ ^d, ...a_p ^d. Ces répartitions peuvent être transmises par un canal dédié (associé aux canaux stéréo pour former le signal mixé, ou indépendant des canaux stéréo), ou par incorporation dans le signal mixé, par exemple par tatouage ou par utilisation de bits non-utilisés du signal mixé. Dans ce dernier cas, le dispositif de séparation peut comprendre un moyen d'extraction des répartitions des signaux sources, recevant en entrée le signal mixé et fournissant en sortie, les répartitions des signaux sources. Le moyen d'extraction des valeurs représentatives et le moyen d'extraction des répartitions peuvent être un seul et même moyen.Similarly, the separation device can also receive the distributions of the source signals in each channel (or channel) of the mixed signal: a ₁ ^g , ... a _p ^g , a ₁ ^d , ... a _p ^d . These distributions can be transmitted by a dedicated channel (associated with the stereo channels to form the mixed signal, or independent of the stereo channels), or by incorporation into the mixed signal, for example by tattooing or by using unused bits of the mixed signal. . In the latter case, the separation device may comprise a means for extracting the distributions of the source signals, receiving as input the mixed signal and providing, as output, the distributions of the source signals. The means for extracting the representative values and the means for extracting the distributions can be one and the same means.

Alternativement, le dispositif de séparation peut comprendre un moyen de détermination des répartitions des signaux sources : un tel moyen de détermination peut recevoir en entrée le signal mixé et les valeurs représentatives Φ_i(k,m), et fournir en sortie la répartition dudit signal source a_i ^g, a_i ^d. Cela est possible notamment lorsque chaque canal du signal mixé comprend les valeurs représentatives d'un signal source pour ledit canal du signal mixé : autrement dit, les valeurs représentatives d'un signal source donné ne seront pas les mêmes pour chaque canal du signal mixé, la différence entre les valeurs représentatives d'un même signal source pour les différents canaux du signal mixé permettant de déterminer la répartition dudit signal source entre les différents canaux du signal mixé.Alternatively, the separation device may comprise means for determining the distributions of the source signals: such a determination means may receive as input the mixed signal and the representative values Φ _i (k, m), and output the distribution of said signal source a _i ^g , a _i ^d . This is possible especially when each channel of the mixed signal comprises the values representative of a source signal for said channel of the mixed signal: in other words, the values representative of a given source signal will not be the same for each channel of the mixed signal. the difference between the representative values of the same source signal for the different channels of the mixed signal making it possible to determine the distribution of said source signal between the different channels of the mixed signal.

Sur la figure 1, on a représenté schématiquement un mode de réalisation d'un dispositif de séparation 1 de signaux sources particuliers contenus dans un signal mixé s_mix. Le dispositif de séparation 1 reçoit en entrée les canaux stéréo s_mix ^g et s_mix ^d du signal mixé s_mix, et délivre des signaux sources particuliers séparés au moins partiellement s'_i, avec 1 variant de 1 à p. Le dispositif de séparation 1 a pour but de délivrer, au moins partiellement, plusieurs signaux sources particuliers contenus dans le signal mixé s_mix en utilisant les valeurs représentatives desdits signaux sources particuliers Φ_i(k,m).On the figure 1 schematically an embodiment of a separation device 1 of particular source signals contained in a mixed signal s _mix . The separation device 1 receives as input the stereo channels s _mix ^g and s _mix ^d of the mixed signal s _mix , and delivers separate particular source signals at least partially s' _i , with 1 varying from 1 to p. The purpose of the separation device 1 is to deliver, at least partially, a plurality of particular source signals contained in the mixed signal _mix by using the representative values of said particular source signals Φ _i (k, m).

On considère pour la présente description, que le dispositif de séparation 1 reçoit en entrée, les canaux du signal mixé numérique audio s_mix ^g(t) et s_mix ^d(t), dans lesquels sont insérées, par exemple par tatouage, les valeurs représentatives des signaux sources particuliers Φ_i(k,m), et éventuellement les répartitions a₁ ^g, ..., a_p ^g, a₁ ^d, ..., a_p ^d des signaux sources particuliers entre les deux canaux du signal mixé numérique audio s_mix ^d(t) et s_mix ^g(t).For the purposes of the present description, it is considered that the separation device 1 receives, as input, the channels of the mixed digital audio signal s _mix ^g (t) and s _mix ^d (t), in which are inserted, for example by tattooing, the values representative of the particular source signals Φ _i (k, m), and possibly the distributions at ₁ ^g , ..., a _p ^g , a ₁ ^d , ..., a _p ^d of particular source signals between the two signal channels Mixed digital audio s _mix ^d (t) and s _mix ^g (t).

Le dispositif de séparation 1 comprend un moyen de transformation 2, un moyen d'extraction 3, un moyen de traitement 4, un moyen de filtrage 5, et un moyen de transformation inverse 6.The separating device 1 comprises a transformation means 2, an extraction means 3, a processing means 4, a filtering means 5, and an inverse transformation means 6.

Le moyen de transformation 2 reçoit en entrée les canaux du signal mixé numérique audio s_mix ^g(t) et s_mix ^d(t) et délivre, en sortie, la transformée des canaux du signal mixé dans le plan temps-fréquence S_mix ^g(k,m) et S_mix ^d(k,m).The transformation means 2 receives as input the channels of the mixed digital audio signal s _mix ^g (t) and s _mix ^d (t) and delivers, at the output, the transform of the channels of the mixed signal in the time-frequency plane S _mix ^g (k, m) and S _mix ^d (k, m).

Le moyen d'extraction 3 reçoit en entrée la transformée des canaux du signal mixé dans le plan temps-fréquence S_mix ^d(k,m) et S_mix ^g(k,m), et délivre les valeurs représentative Φ_i(k,m) des signaux sources particuliers contenues dans le signal mixé. Le moyen d'extraction 3 peut, le cas échéant, délivrer également les répartitions a_i ^g, ..., a_p ^g, a₁ ^d, ..., a_p ^d des signaux sources particuliers entre les deux canaux du signal mixé numérique audio s_mix ^d(t) et s_mix ^g(t), lorsque celles-ci sont insérées dans le signal mixé. Le moyen d'extraction 3 permet ainsi d'extraire du signal mixé les valeurs représentatives qui y ont été ajoutées a posteriori, par exemple par tatouage, et des isoler du signal mixé. Les valeurs représentatives Φ_i(k,m) sont alors transmises au moyen de traitement 4 et, le cas échéant, les répartitions a₁ ^g, ..., a_p ^g, a₁ ^d, ..., a_p ^d sont transmises au moyen de filtrage 5.The extraction means 3 receives as input the transformation of the channels of the mixed signal in the time-frequency plane S _mix ^d (k, m) and S _mix ^g (k, m), and delivers the representative values Φ _i (k, m) particular source signals contained in the mixed signal. The extraction means 3 can, if necessary, also deliver the distributions a _i ^g , ..., a _p ^g , a ₁ ^d , ..., a _p ^d of particular source signals between the two channels of the mixed signal. digital audio s _mix ^d (t) and s _mix ^g (t), when these are inserted in the mixed signal. The extraction means 3 thus makes it possible to extract from the mixed signal the representative values that have been added thereto a posteriori, for example by tattooing, and to isolate the mixed signal. The representative values Φ _i (k, m) are then transmitted to the processing means 4 and, where appropriate, the distributions at ₁ ^g , ..., a _p ^g , a ₁ ^d , ..., a _p ^d are transmitted by means of filtering 5.

Il convient de noter que le moyen d'extraction 3 peut alternativement recevoir directement en entrée, les canaux du signal mixé s_mix ^d(t) et s_mix ^g(t).It should be noted that the extraction means 3 may alternatively receive directly at the input, the channels of the mixed signal s _mix ^d (t) and s _mix ^g (t).

Le moyen de traitement 4 permet de traiter les valeurs représentatives Φ_i(k,m) reçues par le moyen d'extraction 3, afin de déterminer une estimation de la puissance normalisée ϕ'_i(k,m) des signaux sources à séparer, dans le plan temps-fréquence. Les estimations de la puissance normalisée ϕ'_i(k,m) des signaux sources à séparer sont ensuite transmises au moyen de filtrage 5.The processing means 4 make it possible to process the representative values Φ _i (k, m) received by the extraction means 3, in order to determine an estimate of the normalized power φ ' _i (k, m) of the source signals to be separated, in the time-frequency plane. The estimates of the normalized power φ ' _i (k, m) of the source signals to be separated are then transmitted to the filtering means 5.

La transformée des canaux du signal mixé dans le plan temps-fréquence S_mix ^d(k,m) et S_mix ^g(k,m) fournie par le moyen de transformation 2, les estimations des puissances normalisées des signaux sources particuliers ϕ'_i(k,m), et les répartitions a₁ ^g, ..., a_p ^g, a₁ ^d, ..., a_p ^d des signaux sources particuliers entre les deux canaux du signal mixé numérique audio s_mix ^d(t) et s_mix ^g(t), sont ainsi fournies au moyen de filtrage 5.The transform of the channels of the mixed signal in the time-frequency plane S _mix ^d (k, m) and S _mix ^g (k, m) provided by the transformation means 2, the estimates of the normalized powers of the particular source signals φ ' _i (k, m), and the distributions at ₁ ^g , ..., a _p ^g , a ₁ ^d , ..., a _p ^d of particular source signals between the two channels of the mixed digital audio signal s _mix ^d (t ) and s _mix ^g (t), are thus provided to the filtering means 5.

Le moyen de filtrage 5 permet d'obtenir une estimation S'_i(k,m) de chaque signal source particulier, grâce à un filtrage spatial. Le moyen de filtrage 5 permet, dans le plan temps-fréquence, d'isoler le signal source particulier, grâce à un filtrage spatial à variance minimale sous contrainte linéaire. Plus particulièrement, le moyen de filtrage 5 se base sur la répartition dudit signal source particulier entre les deux canaux du signal mixé pour isoler le signal source particulier : il s'agit donc d'un filtrage spatial (en anglais : « beamforming »). Par ailleurs, afin d'améliorer le filtrage et l'estimation du signal source obtenue, le filtre spatial utilise la puissance normalisée du signal source particulier à séparer comme contrainte linéaire, afin d'obtenir une estimation plus proche du signal source d'origine.The filtering means 5 makes it possible to obtain an estimation S ' _i (k, m) of each particular source signal, by means of spatial filtering. The filtering means 5 makes it possible, in the time-frequency plan, to isolate the particular source signal, by means of spatial filtering with minimum variance under linear stress. More particularly, the filtering means 5 is based on the distribution of said particular source signal between the two channels of the mixed signal to isolate the particular source signal: it is therefore a spatial filtering (in English: "beamforming"). On the other hand, in order to improve the filtering and estimation of the source signal obtained, the spatial filter uses the normalized power of the particular source signal to be separated as a linear constraint, in order to obtain an estimate closer to the original source signal.

Plus précisément, dans le plan temps-fréquence, on a : $S_{mix} (k, m) = A . S (k, m)$

avec :

S_{mix} (k, m) = {[{S_{mix}}^{g} (k, m), {S_{mix}}^{d} (k, m)]}^{T}

et

S (k, m) = {[S_{1} (k, m), \dots, S_{p} (k, m)]}^{T}

More precisely, in the time-frequency plane, we have:

S_{mix} (k, m) = AT . S (k, m)

with:

S_{mix} (k, m) = {[{S_{mix}}^{boy Wut} (k, m), {S_{mix}}^{d} (k, m)]}^{T}

and

S (k, m) = {[S_{1} (k, m), ..., S_{p} (k, m)]}^{T}

Chaque signal mixé S_mix ^d(k,m) et S_mix ^g(k,m) est alors décomposé en estimations de signaux sources particuliers S'₁(k,m),...,S'_p(k,m) en utilisant un filtrage spatial linéaire : $S'_{i} (k, m) = {w_{ik}}^{g} . {S_{mix}}^{g} (k, m) + {w_{ik}}^{d} . {S_{mix}}^{d} (k, m) = {W_{ik}}^{T} . S_{mix} (k, m)$

avec:

W_{ik} = {[{w_{ik}}^{g}, {w_{ik}}^{d}]}^{T} et S'_{i} (k, m) = {[{S'_{i}}^{g} (k, m), {S'_{i}}^{d} (k, m)]}^{T}

Each mixed signal S _mix ^d (k, m) and S _mix ^g (k, m) is then decomposed into estimates of particular source signals S ' ₁ (k, m), ..., S' _p (k, m) using linear spatial filtering:

S'_{i} (k, m) = {w_{ik}}^{boy Wut} . {S_{mix}}^{boy Wut} (k, m) + {w_{ik}}^{d} . {S_{mix}}^{d} (k, m) = {W_{ik}}^{T} . S_{mix} (k, m)

with:

W_{ik} = {[{w_{ik}}^{boy Wut}, {w_{ik}}^{d}]}^{T} and S'_{i} (k, m) = {[{S'_{i}}^{boy Wut} (k, m), {S'_{i}}^{d} (k, m)]}^{T}

W_ik est le filtre spatial (ou « beamformer ») permettant d'obtenir l'estimation S'_i(k,m) du i^ème signal source dans la sous-bande k à partir du signal mixé S_mix(k,m).W _ik is the spatial filter (or "beamformer") for obtaining the estimate S ' _i (k, m) of the i ^th source signal in the subband k from the mixed signal S _mix (k, m) .

Dans le cas d'un filtre spatial à variance minimum sous contrainte linéaire, on considère comme bruit la somme de tous les signaux sources interférant, à l'exception de celui à filtrer. Ainsi, le signal mixé peut être réécrit de la manière suivante : $S_{mix} (k, m) = a_{i} . S_{i} (k, m) + r (k, m)$

où r(k,m) est la somme des autres signaux sources.In the case of a minimum variance spatial filter under linear stress, the sum of all the interfering source signals, with the exception of the one to be filtered, is considered as noise. Thus, the mixed signal can be rewritten as follows:

S_{mix} (k, m) = {at}_{i} . S_{i} (k, m) + r (k, m)

where r (k, m) is the sum of the other source signals.

L'estimation S'_i(k,m) est obtenue en minimisant la puissance moyenne du bruit ou, de manière équivalente, la puissance moyenne de sortie du filtre spatial, selon la direction du signal source à séparer : $P (θ_{i}) = {W_{ik}}^{T} (m) . R'_{Smix} (k, m) . W_{ik} (m)$

où R_Smix est la matrice de corrélation spatiale des deux canaux S_mix ^d(k,m) et S_mix ^g(k,m) du signal mixé S_mix(k, m).The estimate S ' _i (k, m) is obtained by minimizing the average power of the noise or, equivalently, the mean output power of the spatial filter, according to the direction of the source signal to be separated:

P (θ_{i}) = {W_{ik}}^{T} (m) . R'_{smix} (k, m) . W_{ik} (m)

where R _Smix is the spatial correlation matrix of the two S channels _mix ^d (k, m) and S _mix ^g (k, m) of the mixed signal S _mix (k, m).

La solution est donnée par : $W_{ik} (m) = R'_{Smix}^{- 1} (k, m) . a_{i} . \sqrt{\frac{ϕ'_{i} (k, m)}{{a_{i}}^{T} . {R'}^{- 1}_{Smix} (k, m) . a_{i}}}$

The solution is given by:

W_{ik} (m) = R'_{smix}^{- 1} (k, m) . {at}_{i} . \sqrt{\frac{φ'_{i} (k, m)}{{at}_{i}^{T} . {R'}^{- 1}_{smix} (k, m) . {at}_{i}}}

On obtient donc : $S'_{i} (k, m) = \sqrt{\frac{ϕ'_{i} (k, m)}{{a_{i}}^{T} . R'_{Smix}^{- 1} (k, m) . a_{i}}} . {a_{i}}^{T} . {R'}^{- 1}_{Smix} (k, m) . S_{mix} (k, m)$

avec : R'_Smix ^-1(k, m) = ∑ ϕ'_i(k,m).a_i.a_i ^T.We thus obtain:

S'_{i} (k, m) = \sqrt{\frac{φ'_{i} (k, m)}{{at}_{i}^{T} . R'_{smix}^{- 1} (k, m) . {at}_{i}}} . {at}_{i}^{T} . {R'}^{- 1}_{smix} (k, m) . S_{mix} (k, m)

with: R ' _Smix ^-1 (k, m) = Σ φ' _i (k, m) .a _i .a _i ^T.

Une fois appliqué au signal mixé S_mix(k, m), le filtre obtenu permet de réduire la contribution du spectre en puissance des autres signaux. Par ailleurs, grâce à la contrainte linéaire, la puissance du signal source estimé correspond à la puissance du signal source initial pour les différents points du plan temps-fréquence (ce qui peut être vérifié en réinjectant la solution W_ik dans l'équation définissant P(θ_i)). Ainsi, le moyen de filtrage 5 permet de décorréler spatialement the i^ème signal source du reste du signal mixé, tout en ajustant l'amplitude dudit signal décorrélé au niveau souhaité.When applied to the mixed signal S _mix (k, m), the resulting filter makes it possible to reduce the contribution of the power spectrum of the other signals. Moreover, thanks to the linear constraint, the power of the estimated source signal corresponds to the power of the initial source signal for the different points of the time-frequency plane (which can be verified by re-injecting the solution W _ik into the equation defining P (θ _i )). Thus, the filtering means 5 makes it possible to spatially decorrelate the i ^th source signal from the rest of the mixed signal, while adjusting the amplitude of said decorrelated signal to the desired level.

On peut également noter que, lorsque la quantité d'informations tatouées dans le signal mixé est trop importante pour que le bruit du tatouage puisse être négligé, il est également possible d'ajuster les composants des signaux sources estimés comme suit : $S'_{i} (k, m) = S'_{i} (k, m) . (√ϕ'_{i} (k, m)) / |S'_{i} (k, m)|$

It may also be noted that, when the amount of information tattooed in the mixed signal is too large for the tattoo noise to be neglected, it is also possible to adjust the components of the estimated source signals as follows:

S'_{i} (k, m) = S'_{i} (k, m) . (√φ'_{i} (k, m)) / |S'_{i} (k, m)|

Les transformées des estimations des signaux sources particuliers séparés sont alors transmises au moyen de transformation inverse 6. Le moyen 6 permet de transformer les transformées des estimations des signaux sources séparés en signaux temporels s'₁(t), ..., s'p(t) correspondant, au moins partiellement, aux signaux sources s₁(t), ..., s_p(t).The transforms of the estimates of the separate particular source signals are then transmitted to the inverse transformation means 6. The means 6 makes it possible to transform the transforms of the estimates of the separate source signals into time signals s' ₁ (t), ..., s'p (t) corresponding, at least partially, to the source signals s ₁ (t), ..., s _p (t).

Sur la figure 2, on a représenté un organigramme représentant les différentes étapes du procédé de séparation selon l'invention.On the figure 2 there is shown a flow chart showing the different steps of the separation process according to the invention.

Le procédé comprend une première étape 7 au cours de laquelle le signal mixé est transformé dans un plan temps-fréquence. Puis, dans une étape 8, on effectue une extraction des informations tatouées dans le signal mixé, notamment les valeurs représentatives et les répartitions des signaux sources entre au moins deux canaux du signal mixé. Lors d'une étape 9, on détermine les puissances normalisées des signaux sources à séparer, puis on effectue, lors de l'étape 10, un filtrage spatial à variance minimum sous contrainte linéaire, la contrainte étant la puissance normalisée du signal source à séparer. Enfin on effectue, dans une étape 11, une transformation inverse des transformées des signaux sources particuliers séparés, de manière à obtenir, au moins partiellement, les signaux sources particuliers.The method includes a first step 7 in which the mixed signal is transformed in a time-frequency plane. Then, in a step 8, the information tattooed in the mixed signal is extracted, in particular the representative values and the distributions of the source signals between at least two channels of the mixed signal. During a step 9, the normalized powers of the source signals to be separated are determined, and then, in step 10, a minimum variance spatial filtering under linear stress is performed, the constraint being the normalized power of the source signal to be separated. . Finally, in a step 11, an inverse transformation of the transforms of the separate particular source signals is performed so as to obtain, at least partially, the particular source signals.

Dans le cas de signaux audio, il est ainsi possible d'effectuer en sortie du système de séparation de l'invention un certain nombre de contrôles majeurs en écoute audio (volume, tonalité, effets) de façon indépendante sur les différents éléments de la scène sonore (instruments et voix obtenus par le dispositif de séparation).In the case of audio signals, it is thus possible to perform at the output of the separation system of the invention a number of major controls in audio listening (volume, tone, effects) so independent on the different elements of the sound stage (instruments and voices obtained by the separation device).

Claims

A method of separating, at least in part, one or more particular digital audio source signals (s_i) contained in a mixed multichannel digital audio signal (s_mix), the mixed signal being obtained by mixing a plurality of digital audio source signals (s₁, ..., s_p) and including representative values (Φ_i) of the particular source signal(s), the method comprising:
· determining (9) the modulus of the amplitude or the normalized power (ϕ'_i) of the particular source signal(s) from the representative values (Φ_i) in the time-frequency plane of said particular source signal(s) contained in the mixed signal;
the method being characterized by then performing (10) linearly constrained minimum variance spatial filtering in order to obtain, at least in part, each particular source signal (s'_i), said filtering being based on the distribution (a_i) of said particular source signal between at least two channels of the mixed signal, and the modulus of the amplitude or the normalized power (ϕ'_i) of said particular source signal being used as a linear constraint of the filter.
A method according to claim 1, wherein the mixed signal includes representative values (Φ_i ^g, Φ_i ^d) of the particular source signal(s) for at least two channels of the mixed signal (s_mix ^g, s_mix ^d), and wherein, prior to performing spatial filtering, the mixed signal and said representative values of the particular signals are used to determine the distribution (a_i ^g, a_i ^d) of each particular source signal (s_i) between said at least two channels of the mixed signal (s_mix ^g, s_mix ^d).
A method according to claim 1, wherein the distribution (a_i) of the particular source signal(s) between at least two channels of said mixed signal is received as input, e.g. in the mixed signal.
A method according to any preceding claim, wherein determining the modulus of the amplitude or the normalized power (ϕ'_i) of the particular source signal(s) comprises extracting (8) representative values (Φ_i) of the particular source signals that have been inserted into the mixed signal, e.g. by watermarking.
A method according to any preceding claim, wherein the modulus of the amplitude or the normalized power (ϕ'_i) of said particular source signal are spectro-temporal values.
A device (1) for separating, at least in part, one or more particular digital audio source signals (s_i) contained in a multichannel mixed digital audio signal (s_mix), the mixed signal (s_mix) being obtained by mixing a plurality of digital audio source signals (s₁, ..., s_p) and including representative values (Φ_i) of the particular source signal(s), the device comprising:
· determination means (4) for determining the modulus of the amplitude or the normalized power (ϕ'_i) of the particular source signal(s) from the representative values (Φ_i) in the time-frequency plane of said particular source signal(s) contained in the mixed signal;
the device being characterized by a linearly constrained minimum variance spatial filter (5) adapted to isolate, at least in part, each particular source signal (s'_i) from the mixed signal (s_mix), said filter being based on the distribution (a_i) of said particular source signal between at least two channels of the mixed signal (s_mix ^g, s_mix ^d), and the modulus of the amplitude or the normalized power (ϕ'_i) of said particular source signal being used as a linear constraint.
A device according to claim 6, wherein the mixed signal includes representative values (Φ_i) of the particular source signal(s) for at least two channels of the mixed signal, the device including determination means for determining the distribution (a_i) of each particular source signal between said at least two channels of the mixed signal from the mixed signal and from said representative values (Φ_i) of the particular source signals.
A device according to claim 6 or claim 7, also including extractor means (3) for extracting the representative values (Φ_i) of the particular source signal(s) that have been inserted in the mixed signal, e.g. by watermarking.