EP1830349B1

EP1830349B1 - Method of noise reduction of an audio signal

Info

Publication number: EP1830349B1
Application number: EP07290219A
Authority: EP
Inventors: Guillaume Pinto
Original assignee: Parrot SA
Current assignee: Parrot SA
Priority date: 2006-03-01
Filing date: 2007-02-21
Publication date: 2011-11-30
Anticipated expiration: 2027-02-21
Also published as: EP1830349A1; ATE535905T1; FR2898209A1; WO2007099222A1; FR2898209B1; US20070276660A1; US7953596B2; ES2378482T3

Abstract

The method involves determining a reference signal by applying a process to a noisy audio signal for attenuating voice components in the audio signal by utilizing predictive least mean square (LMS) algorithm. Probability of presence/absence of voice is determined from respective energy levels in a spectral range of the audio and reference signals. A noise spectrum is estimated, and a noise reduction value is derived from the audio signal by utilizing the probability.

Description

CONTEXTE DE L'INVENTIONBACKGROUND OF THE INVENTION Field of the invention

La présente invention concerne le débruitage des signaux audio captés par un microphone dans un environnement bruité.The present invention relates to the denoising of audio signals picked up by a microphone in a noisy environment.

L'invention s'applique avantageusement, mais de façon non limitative, aux signaux de parole captés par les appareils téléphoniques de type "mains-libres" ou analogues.The invention is advantageously applied, but in a nonlimiting manner, to the speech signals picked up by the hands-free telephones or the like.

Ces appareils comportent un microphone sensible captant non seulement la voix de l'utilisateur, mais également le bruit environnant, bruit qui constitue un élément perturbateur pouvant aller, dans certains cas, jusqu'à rendre incompréhensibles les paroles du locuteur.These devices include a sensitive microphone not only capturing the voice of the user, but also the surrounding noise, noise that is a disruptive element that can go, in some cases, to make incomprehensible the speaker's words.

Il en est de même si l'on veut mettre en oeuvre des techniques de reconnaissance vocale, où il est très difficile d'opérer une reconnaissance de forme sur des mots noyés dans un niveau de bruit élevé.It is the same if one wants to implement speech recognition techniques, where it is very difficult to perform a form recognition on words embedded in a high noise level.

Cette difficulté liée au bruit ambiant est particulièrement contraignante dans le cas des dispositifs "mains-libres" pour véhicules automobiles. En particulier, la distance importante entre le microphone et le locuteur entraîne un niveau relatif de bruit élevé qui rend difficile l'extraction du signal utile noyé dans le bruit. De plus, le milieu très bruité typique de l'environnement automobile présente des caractéristiques spectrales non stationnaires, c'est-à-dire qui évoluent de manière imprévisible en fonction des conditions de conduite : passage sur des chaussées déformées ou pavées, autoradio en fonctionnement, etc.This difficulty related to ambient noise is particularly restrictive in the case of devices "hands-free" for motor vehicles. In particular, the large distance between the microphone and the speaker leads to a high relative level of noise which makes it difficult to extract the useful signal embedded in the noise. In addition, the highly noisy environment typical of the automotive environment has non-stationary spectral characteristics, that is to say that evolve unpredictably depending on the driving conditions: passage on deformed or paved roads, car radio operating etc.

Description of the Related Art

Diverses techniques ont été proposées pour réduire le niveau de bruit du signal capté par un microphone.Various techniques have been proposed to reduce the noise level of the signal picked up by a microphone.

Par exemple, le WO-A-98/45997 (Parrot SA) utilise l'appui sur le bouton-poussoir d'activation d'un téléphone (par exemple lorsque le conducteur veut répondre à un appel entrant) pour détecter le début d'un signal de parole et considérer que le signal capté antérieurement à cet appui était essentiellement un signal de bruit. Ce dernier signal, mémorisé, est analysé pour donner un spectre énergétique moyen pondéré du bruit, puis soustrait du signal de parole bruité.For example, the WO-A-98/45997 (Parrot SA) uses the push-button activation of a phone (for example when the driver wants to answer an incoming call) to detect the beginning of a speech signal and consider that the signal previously received to this support was essentially a noise signal. This last signal, stored, is analyzed to give a weighted average energy spectrum of the noise, then subtract from the noisy speech signal.

Le US-A-5 742 694 décrit une autre technique, mettant en oeuvre un mécanisme de type filtre adaptatif prédictif. Ce filtre délivre un "signal de référence" correspondant à la partie prédictible du signal bruité et un "signal d'erreur" correspondant à l'erreur de prédiction, puis atténue ces deux signaux dans des proportions variables, et les recombine pour fournir un signal débruité.The US-A-5,742,694 describes another technique, implementing a predictive adaptive filter type mechanism. This filter delivers a "reference signal" corresponding to the predictable part of the noisy signal and an "error signal" corresponding to the prediction error, then attenuates these two signals in variable proportions, and recombines them to provide a signal noised.

L'inconvénient majeur de cette technique de débruitage réside dans la distorsion importante introduite par le préfiltrage, donnant en sortie un signal très dégradé sur le plan de la qualité acoustique. Elle est en outre mal adaptée aux situations où l'on aurait besoin d'un débruitage énergique avec un signal de parole noyé dans un bruit de nature complexe et imprévisible, avec des caractéristiques spectrales non stationnaires.The major disadvantage of this denoising technique lies in the significant distortion introduced by prefiltering, giving a very degraded signal output in terms of acoustic quality. It is also poorly suited to situations where it would require energetic denoising with a speech signal embedded in a noise of complex and unpredictable nature, with non-stationary spectral characteristics.

D'autre techniques encore, dites beamforming ou double-phoning, mettent en oeuvre deux microphones distincts. Le premier est conçu et placé pour capter principalement la voix du locuteur, tandis que l'autre est conçu et placé pour capter une composante de bruit plus importante que le microphone principal. La comparaison des signaux captés permet d'extraire la voix du bruit ambiant de manière efficace, et par des moyens logiciels relativement simples.Still other techniques, called beamforming or double-phoning , implement two separate microphones. The first is designed and placed to primarily capture the speaker's voice, while the other is designed and placed to capture a larger noise component than the main microphone. The comparison of the signals captured makes it possible to extract the voice of the ambient noise efficiently, and by relatively simple software means.

Cette technique, fondée sur une analyse de cohérence spatiale de deux signaux, présente cependant l'inconvénient de nécessiter deux microphones distants, ce qui la cantonne généralement à des installations fixes ou semi-fixes et ne permet pas de l'intégrer à un dispositif préexistant par simple adjonction d'un module logiciel. Elle présuppose aussi que la position du locuteur par rapport aux deux microphones soit à peu près constante, ce qui est généralement le cas dans un téléphone de voiture utilisé par son conducteur. De plus, pour obtenir un débruitage à peu près satisfaisant, les signaux sont soumis à un préfiltrage important ce qui présente, ici encore, l'inconvénient d'introduire des distorsions venant dégrader la qualité du signal débruité restitué.This technique, based on a spatial coherence analysis of two signals, however, has the disadvantage of requiring two remote microphones, which generally confines it to fixed or semi-fixed installations and does not allow to integrate it into a pre-existing device by simply adding a software module. It also assumes that the speaker's position relative to the two microphones is approximately constant, which is generally the case in a car phone used by its driver. In addition, to achieve a near satisfactory denoising, the signals are subjected to a significant pre-filtering, which again has the disadvantage of introducing distortions that degrade the quality of the denoised signal restored.

L'invention concerne une technique de débruitage des signaux audio captés par un microphone unique enregistrant un signal de voix dans un environnement bruité.The invention relates to a technique for denoising audio signals picked up by a single microphone recording a voice signal in a noisy environment.

Une part importante des méthodes les plus efficaces mises en oeuvre dans les système à un seul microphone se fondent sur le modèle statistique établi par D. Malah et Y. Ephraim dans :

[1] Y. Ephraim et D. Malah, Speech Enhancement using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-32, No 6, pp. 1109-1121, Dec. 1984 , et
[2] Y. Ephraim et D. Malah, Speech Enhancement using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-33, No 2, pp. 443-445, April 1985 .

A large part of the most efficient methods used in single-microphone systems are based on the statistical model established by D. Malah and Y. Ephraim in:

[1] Y. Ephraim and D. Malah, Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-32, No. 6, pp. 1109-1121, Dec. 1984 , and
[2] Y. Ephraim and D. Malah, Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-33, No. 2, pp. 443-445, April 1985 .

Faisant l'approximation que la parole et le bruit sont des processus gaussiens non corrélés et présupposant que la puissance spectrale du bruit soit une donnée connue, ces deux articles donnent une solution optimale au problème de réduction de bruit décrit plus haut. Cette solution propose de découper le signal bruité en composantes fréquentielles indépendantes par l'utilisation de la transformée de Fourier discrète, d'appliquer un gain optimal sur chacune de ces composantes puis de recombiner le signal ainsi traité. Les deux articles divergent sur le choix du critère d'optimalité. Dans [1], le gain appliqué est nommé gain STSA et permet de minimiser la distance quadratique moyenne entre le signal estimé (à la sortie de l'algorithme) et le signal de parole originel (non bruité). Dans [2], l'application d'un gain nommé gain LSA permet quant à elle de minimiser la distance quadratique moyenne entre le logarithme de l'amplitude du signal estimé et le logarithme de l'amplitude du signal de parole original. Ce second critère se montre supérieur au premier car la distance choisie est en bien meilleure adéquation avec le comportement de l'oreille humaine et donne donc qualitativement de meilleurs résultats. Dans tous les cas, l'idée essentielle est de diminuer l'énergie des composantes fréquentielles très bruités en leur appliquant un gain faible tout en laissant intactes (par l'application d'un gain égal à 1) celles qui le sont peu ou pas du tout.Making the approximation that speech and noise are uncorrelated Gaussian processes and presupposing that the spectral power of noise is a known datum, these two articles give an optimal solution to the problem of noise reduction described above. This solution proposes to cut the noisy signal into independent frequency components by using the discrete Fourier transform, to apply an optimal gain on each of these components and then to recombine the signal thus treated. The two articles differ on the choice of the criterion of optimality. In [1], the applied gain is called the STSA gain and allows to minimize the mean squared distance between the estimated signal (at the output of the algorithm) and the original speech signal (non-noisy). In [2], the application of a gain called gain LSA makes it possible to minimize the mean square distance between the logarithm of the amplitude of the estimated signal and the logarithm of the amplitude of the original speech signal. This second criterion is superior to the first because the distance chosen is much better suited to the behavior of the human ear and therefore qualitatively gives better results. In all cases, the essential idea is to reduce the energy of the very noisy frequency components by applying a low gain while leaving intact (by the application of a gain equal to 1) those that are little or no at all.

Bien que séduisant puisque soutenu par une démonstration mathématique rigoureuse, ce procédé ne peut toutefois pas être mis en oeuvre tout seul. En effet, comme indiqué plus haut, la puissance spectrale du bruit est inconnue et imprévisible ex ante. De plus, ce même procédé ne propose pas d'évaluer à quels moments la parole du locuteur est présente dans le signai capté. Il se contente simplement de supposer soit que la parole est toujours présente, soit qu'elle est présente une portion fixe du temps, ce qui peut limiter sérieusement la qualité de la réduction de bruit.Although attractive as supported by a rigorous mathematical demonstration, this method can not however be implemented by itself. Indeed, as indicated above, the spectral power of the noise is unknown and unpredictable ex ante. Moreover, this same process does not propose not to evaluate at what moments the speech of the speaker is present in the captured signal. It simply assumes either that speech is always present, or that it is present a fixed portion of time, which can seriously limit the quality of noise reduction.

Il est donc nécessaire d'utiliser un autre algorithme ayant pour fonction d'évaluer la puissance spectrale du bruit ainsi que les instants où la parole du locuteur est présente sur le signal brut capté. Il s'avère même que cette estimation constitue le facteur déterminant de la qualité de la réduction de bruit opérée, l'algorithme d'Ephraim et Malah n'étant que la manière optimale d'utiliser l'information ainsi obtenue.It is therefore necessary to use another algorithm whose function is to evaluate the spectral power of the noise as well as the times when the speech of the speaker is present on the raw signal picked up. It even turns out that this estimate is the determining factor of the quality of the noise reduction operated, the algorithm of Ephraim and Malah being only the optimal way to use the information thus obtained.

C'est une solution originale à ce double problème d'évaluation du bruit et des instants de présence du signal de parole qu'apporte la présente invention.This is an original solution to this double problem of evaluation of the noise and moments of presence of the speech signal that the present invention provides.

Ces deux questions sont en réalité intrinsèquement liées. En effet supposons que le signal brut capté est découpé en trames de longueurs égales, dont on calcule pour chacune la transformée de Fourier à court terme.These two questions are in fact intrinsically linked. Indeed, suppose that the raw signal picked up is cut into frames of equal lengths, for which the Fourier transform is calculated for the short term.

Pour une composante fréquentielle donnée, la connaissance des indices des trames où la parole est absente permet d'évaluer la puissance du bruit ainsi que son évolution au cours du temps sur ce segment du spectre. Il suffit en effet de mesurer l'énergie du signal brut lorsque la parole est absente et de faire une moyenne continuellement mise à jour de ces mesures. La question principale est donc de savoir quand exactement la parole du locuteur est absente du signal capté par le microphone.For a given frequency component, the knowledge of the indices of the frames where the speech is absent makes it possible to evaluate the power of the noise as well as its evolution over time on this segment of the spectrum. It suffices to measure the energy of the raw signal when the speech is absent and to make an average continuously updated these measurements. The main question is therefore when exactly the speech of the speaker is absent from the signal picked up by the microphone.

Si le bruit est stationnaire ou pseudo-stationnaire, ce problème peut être aisément résolu en déclarant que la parole est absente dans un segment de spectre d'une trame donnée lorsque l'énergie spectrale des données pour ce segment de spectre n'a pas évolué ou a peu évolué par rapport aux dernières trames. Inversement, on déclare que la parole est présente en cas de comportement non stationnaire.If the noise is stationary or pseudo-stationary, this problem can be easily solved by declaring that speech is absent in a spectrum segment of a given frame when the spectral energy of the data for that spectrum segment has not evolved. or has changed little compared to the last frames. Conversely, speech is said to be present in case of non-stationary behavior.

Toutefois, dans une environnement réel, a fortiori un environnement automobile dont on a indiqué plus haut que le bruit comportait de nombreuses caractéristiques spectrales non stationnaires, ce procédé est aisément pris en défaut, dans la mesure où aussi bien la parole que le bruit peuvent présenter des comportement transitoires. Or, si l'on décide de conserver toutes les composantes transitoires, il restera du bruit musical résiduel dans les données débruitées ; inversement, si l'on décide de supprimer les composantes transitoires en deçà d'un seuil énergétique donné, les composantes faibles de la parole seront alors effacées, alors que ces composantes peuvent être importantes, tant pour leur contenu informatif que pour l'intelligibilité générale (faible distorsion) du signal débruité restitué après traitement.However, in a real environment, a fortiori an automobile environment which has been indicated above that the noise had many non-stationary spectral characteristics, this method is easily faulted, insofar as both speech and noise can present transient behavior. However, if we decide to keep all the transitional components, there will be musical noise residual in the debruised data; conversely, if it is decided to suppress the transient components below a given energy threshold, then the weak components of speech will be erased, whereas these components may be important, both for their informative content and for the general intelligibility. (low distortion) of the denoised signal restored after treatment.

À cet égard, diverses méthodes ont été proposées. Parmi les plus efficaces, on peut citer celle décrite par :

[3] I. Cohen et B. Berdugo, Speech Enhancement for Non-Stationary Noise Environments, Signal Processing, Elsevier, Vol. 81, pp. 2403-2418,2001 ,

In this regard, various methods have been proposed. Among the most effective, we can mention the one described by:

[3] I. Cohen and B. Berdugo, Speech Enhancement for Non-Stationary Noise Environments, Signal Processing, Elsevier, Vol. 81, pp. 2403 to 2418.2001 ,

Comme fréquemment dans le domaine, le procédé décrit dans cet article n'a pas pour objectif d'identifier précisément sur quelles composantes fréquentielles de quelles trames la parole est absente, mais plutôt de donner un indice de confiance entre 0 et 1, une valeur 1 indiquant que la parole est absente à coup sûr (selon l'algorithme) tandis qu'une valeur 0 déclare le contraire. De par sa nature, cet indice est assimilé à la probabilité d'absence de la parole a priori, c'est à dire la probabilité que la parole soit absente sur une composante fréquentielle donnée de la trame considérée. Il s'agit bien sûr d'une assimilation non rigoureuse dans le sens que même si la présence de la parole est probabiliste ex ante, le signal capté par le microphone ne peut à chaque instant que passer par deux états distincts. Il peut soit (à l'instant considéré) comporter de la parole soit ne pas en contenir. Toutefois cette assimilation donne de bons résultats en pratique ce qui justifie son utilisation. Afin d'estimer cette probabilité d'absence, Cohen et Berdugo utilisent des moyennes sur des rapports signal à bruit a priori eux mêmes utilisés et calculés dans l'algorithme d'Ephraim et Malah. Ces auteurs décrivent également la technique dite de gain OM-LSA (Optimally-Modified Log-Spectral Amplitude), visant à améliorer le gain LSA par l'intégration de cette probabilité d'absence de la parole.As is frequently the case in the field, the method described in this article is not intended to identify precisely on which frequency components of which frames the speech is absent, but rather to give a confidence index between 0 and 1, a value of 1 indicating that the speech is absent for sure (according to the algorithm) while a value 0 declares the opposite. By its nature, this index is likened to the probability of absence of speech a priori , ie the probability that speech is absent on a given frequency component of the frame considered. This is of course a non-rigorous assimilation in the sense that even if the presence of speech is probabilistic ex ante, the signal picked up by the microphone can at any moment only go through two distinct states. It can either (at the moment considered) include speech or not contain it. However, this assimilation gives good results in practice which justifies its use. To estimate this probability of absence, Cohen and Berdugo use averages on signal-to-noise ratios a priori themselves used and calculated in the algorithm of Ephraim and Malah. These authors also describe the so-called OM-LSA gain technique ( Optimally-Modified Log-Spectral Amplitude ), aimed at improving the LSA gain by integrating this probability of absence of speech.

Cette estimation de la probabilité a priori d'absence de la parole se révèle efficace, mais dépend directement du modèle statistique élaboré par Ephraim et Malah et non d'une connaissance a priori des données.This estimate of the a priori probability of absence of speech proves to be effective, but depends directly on the statistical model developed by Ephraim and Malah and not on a priori knowledge of the data.

Pour obtenir une estimée de la probabilité d'absence qui soit indépendante de ce modèle statistique, Cohen et Berdugo ont proposé dans :

[4] I. Cohen et B. Berdugo, Two Channel Signal Detection and Speech Enhancement Based on the Transient Beam-to-Reference Ratio, Proc. ICASSP 2003, Hong Kong, pp. 233-236, April 2003 ,

de calculer la probabilité d'absence à partir de signaux captés par deux microphones différemment placés, donnant des signaux respectifs sur deux voies différentes, dont la combinaison permet d'obtenir une voie dite de sortie et une voie dite de bruit de référence. L'analyse est basée sur la constatation que les composantes de parole sont relativement plus faibles sur la voie de bruit de référence, et que les composantes de bruit transitoire présentent à peu près la même énergie sur les deux voies. Une probabilité de présence de parole pour chaque segment de spectre de chaque trame est déterminée en calculant un ratio d'énergie entre les composantes non stationnaires des signaux respectifs des deux voies.To obtain an estimate of the probability of absence that is independent of this statistical model, Cohen and Berdugo proposed in:

[4] I. Cohen and B. Berdugo, Two Channel Signal Detection and Speech Enhancement Based on the Transient Beam-to-Reference Ratio, Proc. ICASSP 2003, Hong Kong, pp. 233-236, April 2003 ,

calculating the probability of absence from signals picked up by two differently placed microphones, giving respective signals on two different channels, the combination of which makes it possible to obtain a so-called output channel and a so-called reference noise channel. The analysis is based on the finding that the speech components are relatively weaker on the reference noise path, and that the transient noise components have approximately the same energy on both paths. A speech presence probability for each spectrum segment of each frame is determined by computing an energy ratio between the non-stationary components of the respective signals of the two channels.

Mais, comme pour les techniques de beamforming ou double-phoning évoquées plus haut, ce procédé est assez contraignant dans la mesure où il nécessite deux microphones.But, as for beamforming or double-phoning techniques mentioned above, this process is quite restrictive in that it requires two microphones.

RÉSUMÉ DE L'INVENTIONSUMMARY OF THE INVENTION

L'un des buts de l'invention est de remédier aux inconvénients des méthodes proposées jusqu'à présent, grâce à un procédé perfectionné de débruitage applicable à un signal de parole considéré isolément, notamment un signal capté par un microphone unique, procédé qui soit basé sur l'analyse de la cohérence temporelle des signaux captés.One of the aims of the invention is to overcome the drawbacks of the methods proposed up to now, by means of an improved denoising method applicable to a speech signal considered in isolation, in particular a signal picked up by a single microphone, a method which is based on the analysis of the temporal coherence of the captured signals.

Le point de départ de l'invention réside dans la constatation que la parole présente généralement une cohérence temporelle supérieure au bruit et que, de ce fait, elle est nettement plus prédictible. Essentiellement, l'invention propose d'utiliser cette propriété pour calculer un signal de référence où la parole aura été plus atténuée que le bruit, en appliquant notamment un algorithme prédictif qui pourra par exemple être de type LMS (Least Mean Squares, moindres carrés moyens). Ce signal de référence dérivé du signal de parole à débruiter pourra être utilisé de façon comparable à celle du signal du second microphone des techniques de beam-forming à deux voies, par exemple des techniques semblables à celles de Cohen et Berdugo [4, précité]. Le calcul d'un ratio entre les niveaux d'énergie respectifs du signal originel et du signal de référence ainsi obtenu permettra de discriminer entre les composantes de parole et les bruits parasites non stationnaires, et fournira une estimation de la probabilité de présence de parole de façon indépendante de tout modèle statistique.The starting point of the invention lies in the observation that speech generally has a temporal coherence greater than noise and that, as a result, it is clearly more predictable. Essentially, the invention proposes to use this property to calculate a reference signal where the speech has been more attenuated than the noise, by applying in particular a predictive algorithm which may for example be of the LMS ( Least Mean Squares, Least Mean Squares ) type. ). This reference signal derived from the speech signal to be denoised may be used in a manner comparable to that of the signal of the second microphone of beam-forming techniques. two-way, for example techniques similar to those of Cohen and Berdugo [4, supra]. The calculation of a ratio between the respective energy levels of the original signal and the reference signal thus obtained will make it possible to discriminate between the speech components and the nonstationary noise noises, and will provide an estimate of the probability of presence of speech of independently of any statistical model.

En d'autres termes, la technique proposée par l'invention met en oeuvre une "soustraction intelligente" impliquant, après une prédiction linéaire opérée sur les échantillons passés du signal originel (et non d'un signal préfiltré, donc dégradé), un recalage de phase entre le signal originel et le signal prédit.In other words, the technique proposed by the invention implements an "intelligent subtraction" implying, after a linear prediction made on the passed samples of the original signal (and not of a prefiltered signal, thus degraded), a registration phase between the original signal and the predicted signal.

La technique de l'invention s'avère, en pratique, suffisamment performante pour assurer un débruitage extrêmement efficace directement sur le signal originel, en s'affranchissant de distorsions introduites par une chaîne de préfiltrage, devenue inutile.The technique of the invention turns out, in practice, sufficiently powerful to provide extremely effective denoising directly on the original signal, freeing distortions introduced by a prefiltering chain, become unnecessary.

Plus précisément, la présente invention propose, pour le débruitage d'un signal audio bruité originel comportant une composante de parole combinée à une composante de bruit comprenant elle-même une composante de bruit transitoire et une composante de bruit pseudo-stationnaire, d'opérer une analyse de cohérence temporelle du signal bruité par les étapes de :

a) détermination d'un signal de référence par application au signal bruité d'un traitement propre à atténuer de façon plus importante les composantes de parole que les composantes de bruit de ce signal bruité, ledit traitement comprenant : (a1) l'application d'un algorithme de prédiction linéaire adaptatif opérant sur une combinaison linéaire des échantillons antérieurs du signal bruité, et (a2) la détermination dudit signal de référence par une soustraction, avec compensation du déphasage, entre le signal bruité originel, non filtré et le signal délivré par l'algorithme de prédiction linéaire ;
b) détermination d'une probabilité de présence/absence de parole a priori à partir des niveaux d'énergie respectifs dans le domaine spectral du signal bruité et du signal de référence ; et
c) utilisation de cette probabilité d'absence de parole a priori pour estimer un spectre de bruit et dériver du signal bruité une estimée débruitée du signal de parole.

More specifically, the present invention proposes, for the denoising of an original noisy audio signal comprising a speech component combined with a noise component comprising itself a transient noise component and a pseudo-stationary noise component, to operate temporal coherence analysis of the noisy signal by the steps of:

a) determining a reference signal by applying to the noisy signal a processing adapted to more significantly attenuate the speech components than the noise components of this noisy signal, said processing comprising: (a1) the application of an adaptive linear prediction algorithm operating on a linear combination of the previous noisy signal samples, and (a2) determining said reference signal by subtracting, with phase shift compensation, between the original, unfiltered noisy signal and the delivered signal by the linear prediction algorithm;
b) determining a probability of presence / absence of speech a priori from the respective energy levels in the spectral range of the noisy signal and the reference signal; and
c) using this probability of absence of speech a priori to estimate a noise spectrum and derive from the noisy signal a denoised estimate of the speech signal.

Le signal de référence peut notamment être déterminé par application à l'étape a2) d'une relation du type : $Ref (k l) = X (k l) - X (k l) \frac{|Y (k l)|}{|X (k l)|}$

où X(k,l) et Y(k,l) sont les transformées de Fourier à court terme de chaque segment de spectre k de chaque trame l, respectivement du signal bruité originel et du signal délivré par l'algorithme de prédiction linéaire.The reference signal may in particular be determined by applying in step a2) a relation of the type:

Ref (k l) = X (k l) - X (k l) \frac{|Y (k l)|}{|X (k l)|}

where X ( k , l ) and Y ( k , l ) are the short-term Fourier transforms of each spectrum segment k of each frame 1 , respectively of the original noisy signal and the signal delivered by the linear prediction algorithm.

L'algorithme prédictif est avantageusement un algorithme adaptatif récursif de type moindres carrés moyens LMS.The predictive algorithm is advantageously a recursive adaptive algorithm of LMS mean least squares type.

L'étape b) comprend avantageusement l'application d'un algorithme d'estimation de l'énergie de la composante de bruit pseudo-stationnaire dans le signal de référence et dans le signal bruité, notamment un algorithme de type à moyennage récursif par contrôle des minima MRCA comme décrit dans :

[5] I. Cohen et B. Berdugo, Noise Estimation by Minima Controlled Recursive Averaging for Robust Speech Enhancement, IEEE Signal Processing Letters, Vol. 9, No 1, pp. 12-15, Jan. 2002 ,

Step b) advantageously comprises the application of an algorithm for estimating the energy of the pseudo-stationary noise component in the reference signal and in the noisy signal, in particular a recursive averaging type algorithm by control MRCA minima as described in:

[5] I. Cohen and B. Berdugo, Noise Estimation by Minima Controlled Recursive Averaging for Robust Speech Enhancement, IEEE Signal Processing Letters, Vol. 9, No. 1, pp. 12-15, Jan. 2002 ,

L'étape c) comprend avantageusement l'application d'un algorithme de gain variable fonction de la probabilité de présence/absence de parole, notamment un algorithme de type gain à amplitude log-spectrale modifié optimisé OM-LSA.Step c) advantageously comprises the application of a variable gain algorithm depending on the probability of presence / absence of speech, in particular an OM-LSA optimized modified log-spectral amplitude gain type algorithm.

DESCRIPTION SOMMAIRE DES DESSINSSUMMARY DESCRIPTION OF THE DRAWINGS

On va maintenant décrire un exemple de mise en oeuvre de l'invention, en référence aux dessins annexés où les mêmes références numériques désignent d'une figure à l'autre des éléments identiques ou fonctionnellement semblables.

La figure 1 est un diagramme schématique illustrant les différentes opérations effectuées par un algorithme de débruitage conformément au procédé de l'invention.
La figure 2 est un diagramme schématique illustrant plus particulièrement l'algorithme prédictif LMS adaptatif.

An embodiment of the invention will now be described with reference to the appended drawings in which the same reference numerals designate elements that are identical or functionally similar from one figure to another.

The figure 1 is a schematic diagram illustrating the various operations performed by a denoising algorithm according to the method of the invention.
The figure 2 is a schematic diagram illustrating more particularly the predictive algorithm LMS adaptive.

DESCRIPTION DÉTAILLÉE DU MODE DE MISE EN OEUVRE PRÉFÉRÉDETAILED DESCRIPTION OF THE PREFERRED MODE OF IMPLEMENTATION

Le signal que l'on souhaite débruiter est un signal numérique échantillonné x(n), où n désigne le numéro de l'échantillon (n est donc la variable temporelle).The signal that we want to denoise is a sampled digital signal x (n) , where n denotes the number of the sample ( n is the temporal variable).

Le signal capté x(n) est une combinaison d'un signal de parole s(n) et d'un bruit surajouté, non corrélé, d(n) : $x (n) = s (n) + d (n)$

The captured signal x (n) is a combination of a speech signal s (n) and an additional noise, uncorrelated, d (n) :

x (not) = s (not) + d (not)

Ce bruit d(n) a deux composantes indépendantes, à savoir une composante transitoire d_t(n) et une composante pseudo-stationnaire d_ps(n) : $d (n) = d_{t} (n) + d_{ps} (n)$

This noise d (n) has two independent components, namely a transient component d _t (n) and a pseudo-stationary component d _ps (n) :

d (not) = d_{t} (not) + d_{ps} (not)

Comme illustré sur la figure 1, le signal bruité x(n) est appliqué en entrée d'un algorithme LMS prédictif schématisé par le bloc 10, incluant l'application de retards appropriés 12. Le fonctionnement de cet algorithme LMS sera décrit plus bas, en référence à la figure 2.As illustrated on the figure 1 , the noisy signal x (n) is applied as input to a predictive LMS algorithm schematized by block 10, including the application of appropriate delays 12. The operation of this LMS algorithm will be described below, with reference to FIG. figure 2 .

On calcule ensuite la transformé de Fourier à court terme du signal capté x(n) (bloc 16), ainsi que du signal y(n) délivré par l'algorithme LMS prédictif (bloc 14). À partir de ces deux transformées est calculé un signal de référence (bloc 18), qui constitue l'une des variables d'entrée d'un algorithme de calcul de la probabilité d'absence de parole (bloc 24). Parallèlement, la transformée du signal bruité x(n), issue du bloc 16, est également appliquée à l'algorithme de calcul de probabilité.The short-term Fourier transform of the captured signal x (n) (block 16) and the signal y (n) delivered by the predictive LMS algorithm (block 14) are then calculated. From these two transforms is calculated a reference signal (block 18), which is one of the input variables of an algorithm for calculating the probability of absence of speech (block 24). Meanwhile, the noisy signal transform x (n), from block 16, is also applied to the probability calculation algorithm.

Les blocs 20 et 22 estiment le bruit pseudo-stationnaire du signal de référence et de la transformée du signal bruité est estimé, et le résultat est également appliqué à l'algorithme de calcul de probabilité.Blocks 20 and 22 estimate the pseudo-stationary noise of the reference signal and the noisy signal transform is estimated, and the result is also applied to the probability calculation algorithm.

Le résultat du calcul de probabilité d'absence de parole, ainsi que la transformée du signal bruité, sont appliqués en entrée d'un algorithme de traitement de gain OM-LSA (bloc 26), dont le résultat est soumis à une transformation inverse de Fourier (bloc 28) pour donner une estimée de la parole débruitée.The result of the speech absence probability calculation, as well as the noisy signal transform, are inputted to an OM-LSA gain processing algorithm (block 26), the result of which is subjected to an inverse transformation of Fourier (block 28) to give an estimate of speech de-noiseed.

On va maintenant décrire plus en détail les différentes phases de ce traitement.The different phases of this treatment will now be described in more detail.

L'algorithme prédictif LMS (bloc 10) est schématisé sur la figure 2.The predictive algorithm LMS (block 10) is schematized on the figure 2 .

Dans la mesure où les signaux en présence sont globalement non stationnaires mais localement pseudo-stationnaires, on peut avantageusement utiliser un système adaptatif, qui pourra tenir compte des variations d'énergie du signal dans le temps et converger vers les divers optima locaux.Insofar as the signals in the presence are globally non-stationary but locally pseudo-stationary, one can advantageously use an adaptive system, which can take into account the variations of energy of the signal over time and converge towards the various local optima.

Essentiellement, si l'on applique des retards successifs Δ, la prédiction linéaire y(n) du signal x(n) est une combinaison linéaire des échantillons antérieurs {x(n - Δ - i + 1)}_1≤i≤M : $y (n) = \sum_{i = 1}^{M} ω_{i} x (n - Δ - i + 1)$

qui minimise l'erreur quadratique moyenne de l'erreur de prédiction :

ϵ (n) = x (n) - y (n)

Essentially, if we apply successive delays Δ, the linear prediction y (n) of the signal x (n) is a linear combination of the earlier samples { x ( n - Δ - i + 1)} 1 i i _{M M} :

there (not) = Σ_{i = 1}^{M} ω_{i} x (not - Δ - i + 1)

which minimizes the mean squared error of the prediction error:

ε (not) = x (not) - there (not)

La minimisation consiste à trouver : $\min_{ω_{1}, ω_{2}, \dots, ω_{M}} E {[x (n) - \sum_{i = 1}^{M} ω_{i} x (n - Δ - i + 1)]}^{2}$

Minimization involves finding:

\min_{ω} E {[x (not) - Σ_{i = 1}^{M} ω_{i} x (not - Δ - i + 1)]}^{2}

Pour résoudre ce problème, il est possible d'utiliser un algorithme LMS, qui est un algorithme en lui-même connu, décrit par exemple dans :

[6] B. Widrow, Adaptative Filters, Aspect of Network and System Theory, R. E. Kalman and N. De Claris (Eds). New York: Holt, Rinehart and Winston, pp. 563-587, 1970 , et
[7] B. Widrow et al., Adaptative Noise Cancelling: Principles and Applications, Proc. IEEE, Vol. 63, No 12 pp. 1692-1716, Dec 1975 .

To solve this problem, it is possible to use an algorithm LMS, which is an algorithm in itself known, described for example in:

[6] B. Widrow, Adaptative Filters, Aspect of Network and System Theory, RE Kalman and N. De Claris (Eds). New York: Holt, Rinehart and Winston, pp. 563-587, 1970 , and
[7] B. Widrow et al., Adaptive Noise Canceling: Principles and Applications, Proc. IEEE, Vol. 63, No. 12 pp. 1692-1716, Dec 1975 .

On peut définir un procédé récursif d'adaptation des pondérations. $ω_{i} (n + 1) = ω_{i} (n) + 2 μϵ (n) x (n - Δ - i + 1)$

µ étant une constante de gain qui permet d'ajuster la vitesse et la stabilité de l'adaptation.It is possible to define a recursive method for adapting weights.

ω_{i} (not + 1) = ω_{i} (not) + 2 με (not) x (not - Δ - i + 1)

μ being a gain constant which makes it possible to adjust the speed and the stability of the adaptation.

On pourra trouver des indications générales sur ces aspects de l'algorithme LMS dans :

[8] B. Widrow et S. Stearns, Adaptative Signal Processing, Prentice-Hall Signal Processing Series, Alan V. Oppenheim Series Editor, 1985 .

General information on these aspects of the LMS algorithm can be found in:

[8] B. Widrow and S. Stearns, Adaptative Signal Processing, Prentice-Hall Signal Processing Series, Alan V. Oppenheim Series Editor, 1985 .

On peut démontrer qu'une telle prédiction linéaire adaptative permet de discriminer efficacement entre bruit et parole car les échantillons contenant de la parole seront bien mieux prédits (plus petites erreurs quadratiques entre la prédiction et le signal brut) que ceux ne contenant que du bruit.It can be shown that such an adaptive linear prediction makes it possible to discriminate effectively between noise and speech because the samples containing speech will be much better predicted (smaller quadratic errors between the prediction and the raw signal) than those containing only noise.

Plus précisément, les signaux respectifs x(n) et y(n) (signal de parole bruitée et prédiction linéaire) sont découpés en trames de longueurs identiques, et leur transformée de Fourier à court terme (notées respectivement X et Y) est calculée pour chaque trame. Pour éviter les effets des erreurs de précision, l'algorithme prévoit un recouvrement de 50% entre trames consécutives, et les échantillons sont multipliés par les coefficients de la fenêtre de Hanning de manière que l'addition des trames paires et impaires corresponde au signal d'origine proprement dit. Pour le segment de spectre k d'une trame l paire, on a : $X (k l) = \sum_{p = 1}^{R} h (p) x (Rl + p) e^{- j 2 π \frac{pk}{R}}$

More precisely, the respective signals x (n) and y (n) (noisy speech signal and linear prediction) are split into frames of identical lengths, and their short-term Fourier transform (denoted respectively X and Y ) is calculated for each frame. To avoid the effects of precision errors, the algorithm predicts a 50% overlap between consecutive frames, and the samples are multiplied by the coefficients of the Hanning window so that the addition of even and odd fields corresponds to the signal of origin itself. For the spectrum segment k of a frame 1 pair, we have:

X (k l) = Σ_{p = 1}^{R} h (p) x (Services + p) e^{- j 2 π \frac{pk}{R}}

Et pour le segment de spectre k d'une trame l impaire : $X (k l) = \sum_{p = 1}^{R} h (p) x (\frac{R}{2} l + p) e^{- j 2 π \frac{pk}{R}}$

h étant la fenêtre de Hanning.And for the spectrum segment k of an odd l- frame:

X (k l) = Σ_{p = 1}^{R} h (p) x (\frac{R}{2} l + p) e^{- j 2 π \frac{pk}{R}}

h being the Hanning window.

Une première possibilité consiste à définir le signal de référence en prenant la transformée de Fourier de l'erreur de prédiction : $\hat{ϵ} (k l) = X (k l) - Y (k l)$

A first possibility is to define the reference signal by taking the Fourier transform of the prediction error:

\hat{ε} (k l) = X (k l) - Y (k l)

Cependant, on constate en pratique un certain déphasage entre X et Y dû à une convergence imparfaite de l'algorithme LMS, empêchant une bonne discrimination entre parole et bruit. On préfère donc adopter pour le signal de référence une autre définition qui compense ce déphasage, à savoir : $Ref (k l) = X (k l) - X (k l) \frac{|Y (k l)|}{|X (k l)|}$

However, there is in practice a certain phase shift between X and Y due to an imperfect convergence of the LMS algorithm, preventing good discrimination between speech and noise. It is therefore preferred to adopt for the reference signal another definition that compensates for this phase difference, namely:

Ref (k l) = X (k l) - X (k l) \frac{|Y (k l)|}{|X (k l)|}

On suppose que l'énergie spectrale du signal de référence peut être décrite sous la forme : $E {[Ref (k l)]}^{2} = E {[S (k l)]}^{2} α_{S} (k) + E {[D_{t} (k l)]}^{2} α_{D_{t}} (k) + E {[D_{ps} (k l)]}^{2} α_{D_{ps}} (k)$

où

α_{S} (k) < α_{D_{t}} (k) < α_{D_{ps}} (k)

représentent l'atténuation sur le signal de référence des trois signaux dans chaque segment de spectre.It is assumed that the spectral energy of the reference signal can be described as:

E {[Ref (k l)]}^{2} = E {[S (k l)]}^{2} α_{S} (k) + E {[D_{t} (k l)]}^{2} α_{D_{t}} (k) + E {[D_{ps} (k l)]}^{2} α_{D_{ps}} (k)

or

α_{S} (k) < α_{D_{t}} (k) < α_{D_{ps}} (k)

represent the attenuation on the reference signal of the three signals in each spectrum segment.

L'étape suivante consiste à délivrer une estimation q(k,l) de la probabilité d'absence de parole dans le signal bruité : $q (k l) = \Pr \{H_{0} (k l)\}$

H₀(k,l) indiquant l'absence de parole (et H₁(k,l) la présence de parole) dans le k ^ième segment de spectre de la l ^ième trame.The next step consists in delivering an estimate q (k, l) of the probability of absence of speech in the noisy signal:

q (k l) = \Pr \{H_{0} (k l)\}

H ₀ (k, l) indicating the absence of speech (and H ₁ (k, l) the presence of speech) in the k ^th spectrum segment of the 1 ^th frame.

La discrimination entre bruit transitoire et parole peut être opérée par une technique comparable à celle de Cohen et Berdugo [5, précité]. Plus précisément, l'algorithme de l'invention évalue un ratio des énergies transitoires sur les deux voies, donné par : $Ω (k l) = \frac{SX (k l) - MX (k l)}{SRef (k l) - MRef (k l)}$

The discrimination between transient noise and speech can be made by a technique comparable to that of Cohen and Berdugo [5, cited above]. More precisely, the algorithm of the invention evaluates a ratio of the transient energies on the two paths, given by:

Ω (k l) = \frac{SX (k l) - MX (k l)}{SRef (k l) - MRef (k l)}

S étant une estimation lissée de l'énergie instantanée : $SX (k l) = SX (k, l - 1) + \sum_{i = - ω}^{ω} b (i) {|X (k l)|}^{2}$

b étant une fenêtre dans le domaine temporel et M étant un estimateur de l'énergie pseudo-stationnaire, qui peut être obtenu par exemple par une méthode MCRA (Minima Controlled Recursive Averaging) du même type que celle décrite par Cohen et Berdugo [5, précité] (cependant plusieurs alternatives existent dans la littérature). S being a smoothed estimate of the instantaneous energy:

SX (k l) = SX (k, l - 1) + Σ_{i = - ω}^{ω} b (i) {|X (k l)|}^{2}

b being a window in the time domain and M being an estimator of the pseudo-stationary energy, which can be obtained for example by a method MCRA ( Minima Controlled Recursive Averaging ) of the same type as that described by Cohen and Berdugo [5, supra] (however, several alternatives exist in the literature).

En présence de parole mais en l'absence de bruit transitoire, ce ratio vaut approximativement : $Ω (k l) = \frac{1}{α_{D_{t}} (k)} = Ω_{\max} (k)$

In the presence of speech but in the absence of transient noise, this ratio is approximately:

Ω (k l) = \frac{1}{α_{D_{t}} (k)} = Ω_{\max} (k)

Inversement, en l'absence de parole mais en présence de bruits transitoires : $Ω (k l) = \frac{1}{α_{S} (k)} = Ω_{\min} (k)$

Conversely, in the absence of speech but in the presence of transient noises:

Ω (k l) = \frac{1}{α_{S} (k)} = Ω_{\min} (k)

Si l'on suppose qu'en général : $Ω_{\min} (k) \leq Ω (k l) \leq Ω_{\max} (k)$

une procédure d'estimation de q(k,l) est donnée par l'algorithme en métalangage suivant :If we assume that in general:

Ω_{\min} (k) \leq Ω (k l) \leq Ω_{\max} (k)

a procedure for estimating q (k, l) is given by the following metalanguage algorithm:

Pour chaque trame l et pour chaque segment de spectre k,For each frame l and for each spectrum segment k,

(i) Calculate SX ( k , l ), MX ( k , l ), SRef ( k , l ) and MRef ( k , l ). Go to (ii)
(ii) If SX ( k , l )> L X MX ( k , l ) (transient detection on the noisy speech path), then go to (iii) otherwise q k ⁢ l = 1
(iii) If SRef ( k, l )> L Ref MRef ( k, l ) (transient detection on the reference path), then go to (iv) otherwise q k ⁢ l = 0
(iv) Calculate Ω ( k , l ). go to (v)
(v) Calculate: q k ⁢ l = max ⁢ min ⁢ Ω max k - Ω k ⁢ l Ω max k - Ω min k ⁢ 1 , 0

Les constantes L_x et L_Ref sont des seuils de détection des transitoires. Ω_min (k) et Ω_m _ax(k) sont les limites supérieure et inférieure pour chaque segment de spectre. Ces divers paramètres sont choisis de manière à correspondre à des situations typiques, proches de la réalité.The constants L _x and L _Ref are transient detection thresholds. Ω _min (k) and Ω _m _ax (k) are the upper and lower limits for each spectrum segment. These various parameters are chosen so as to correspond to typical situations, close to reality.

L'étape suivante (correspondant au bloc 26 de la figure 1) consiste à opérer le débruitage proprement dit (renforcement de la composante de parole). L'estimateur que l'on vient de décrire sera appliqué au modèle statistique décrit par Ephraim et Malah [2, précité], qui suppose que le bruit et la parole dans chaque segment de spectre sont des processus gaussiens indépendants de variances respectives λ_x(k,l) et λ_d(k,l).The next step (corresponding to block 26 of the figure 1 ) consists in operating the denoising itself (reinforcement of the speech component). The estimator just described will be applied to the statistical model described by Ephraim and Malah [2, supra], which assumes that the noise and speech in each spectrum segment are independent Gaussian processes of respective variances λ _x ( k, l) and λ _d (k, l) .

Cette étape peut avantageusement mettre en oeuvre l'algorithme de gain OM-LSA (Optimally Modified Log-Spectral Amplitude Gain) décrit par Cohen et Berdugo [3, précité]. Le rapport signal/bruit a priori est défini par : $ξ (k l) = \frac{λ_{x} (k l)}{λ_{d} (k l)}$

This step may advantageously implement the OM-LSA gain algorithm ( Optimally Modified Log-Spectral Amplitude Gain ) described by Cohen and Berdugo [3, cited above]. The signal / noise ratio a priori is defined by:

ξ (k l) = \frac{λ_{x} (k l)}{λ_{d} (k l)}

Le rapport signal/bruit a posteriori est défini par : $γ (k l) = \frac{{|X (k l)|}^{2}}{λ_{d} (k l)}$

The signal-to-noise ratio a posteriori is defined by:

γ (k l) = \frac{{|X (k l)|}^{2}}{λ_{d} (k l)}

La probabilité conditionnelle de présence du signal est : $p (k l) = \Pr (H_{1} (k l) | X (k l))$

The conditional probability of signal presence is:

p (k l) = \Pr (H_{1} (k l) | X (k l))

Avec l'hypothèse gaussienne et les paramètres ci-dessus, il vient : $p (k l) = {\{1 + \frac{q (k l)}{1 - q (k l)} (1 + ξ (k l)) \exp (- υ (k l))\}}^{- 1}$

avec :

υ (k l) = \frac{γ (k l) ξ (k l)}{1 + ξ (k l)}

With the Gaussian Hypothesis and the parameters above, it comes:

p (k l) = {\{1 + \frac{q (k l)}{1 - q (k l)} (1 + ξ (k l)) \exp (- υ (k l))\}}^{- 1}

with:

υ (k l) = \frac{γ (k l) ξ (k l)}{1 + ξ (k l)}

L'estimée optimale de la parole débruitée S(k,l) est donnée par : $\hat{S} (k l) = G_{H_{1}} {(k l)}^{p (k l)} G_{\min}^{1 - p (k l)} X (k l)$

The optimal estimate of speech de-noiseed S (k, l) is given by:

\hat{S} (k l) = {BOY WUT}_{H_{1}} {(k l)}^{p (k l)} {BOY WUT}_{\min}^{1 - p (k l)} X (k l)

G _H1 étant le gain dans l'hypothèse où la parole est présente, qui est défini par: $G_{H_{1}} (k l) = \frac{ξ (k l)}{1 + ξ (k l)} \exp (\frac{1}{2} \int_{υ (k l)}^{\infty} \frac{e^{- t}}{t} ⅆ t)$

G _H1 being the gain in the hypothesis where speech is present, which is defined by:

{BOY WUT}_{H_{1}} (k l) = \frac{ξ (k l)}{1 + ξ (k l)} \exp (\frac{1}{2} \int_{υ (k l)}^{\infty} \frac{e^{- t}}{t} ⅆ t)

Le gain G_min dans l'hypothèse d'absence de parole est une limite inférieure pour la réduction du bruit, afin de limiter la distorsion de la parole.The G _min gain in the absence of speech hypothesis is a lower limit for noise reduction, in order to limit the distortion of speech.

La formule classique d'estimation du rapport signal/bruit a priori est : $\hat{ξ} (k l) = a G_{H_{1}}^{2} (k, l - 1) γ (k, l - 1) + (1 - a) \max (γ (k l) - 1, 0)$

The classical formula for estimating the signal / noise ratio a priori is:

\hat{ξ} (k l) = at {BOY WUT}_{H}^{_{1}} (k, l - 1) γ (k, l - 1) + (1 - at) \max (γ (k l) - 1, 0)

L'estimation de l'énergie du bruit est donnée par : ${\hat{λ}}_{d} (k, l + 1) = {\tilde{a}}_{d} (k l) {\hat{λ}}_{d} (k l) + β (1 - {\tilde{a}}_{d} (k l)) {|X (k l)|}^{2}$

The noise energy estimate is given by:

{\hat{λ}}_{d} (k, l + 1) = {\tilde{at}}_{d} (k l) {\hat{λ}}_{d} (k l) + β (1 - {\tilde{at}}_{d} (k l)) {|X (k l)|}^{2}

Le paramètre de lissage ã_d évolue entre une limite inférieure a_d et 1, en fonction de la probabilité de présence conditionnelle : ${\hat{a}}_{d} (k l) = a_{d} + (1 - a_{d}) p (k l)$

β étant un facteur de surestimation qui compense le biais en l'absence de signal.The smoothing parameter ã _d evolves between a lower limit of _d and 1, depending on the probability of conditional presence:

{\hat{at}}_{d} (k l) = {at}_{d} + (1 - {at}_{d}) p (k l)

β being an overestimation factor that compensates for bias in the absence of a signal.

Le signal obtenu à l'issue de ce traitement est soumis à une transformée de Fourier inverse (bloc 28) pour donner l'estimée finale de la parole débruitée.The signal obtained at the end of this treatment is subjected to an inverse Fourier transform (block 28) to give the final estimate of the denoised speech.

L'algorithme de la présente invention se révèle particulièrement efficace dans les environnements bruyants, parasités à la fois par des bruits mécaniques, des vibrations, etc. ainsi que par des bruits musicaux, situations caractéristiques rencontrées dans l'habitacle d'une voiture. Les spectrogrammes montrent que l'atténuation du bruit est non seulement efficace, mais se fait sans distorsion notable de la parole après débruitage.The algorithm of the present invention is particularly effective in noisy environments, parasitized by both mechanical noises, vibrations, etc. as well as by musical noises, characteristic situations encountered in the interior of a car. Spectrograms show that the attenuation of the noise is not only effective, but is done without significant distortion of speech after denoising.

Claims

A method for processing an audio signal, for the denoising of an original noisy signal comprising a speech component combined with a noise component, this noise component itself comprising a transient noise component and a pseudo-stationary noise component, characterized in that this method is a method of analyzing the temporal coherence of the sampled noisy signal comprising the steps of:
a) determination of a reference signal by application to the noisy signal of a processing (10, 18) suitable for more significantly attenuating the speech components than the noise components of this noisy signal, the said processing comprising:
a1) the application of an adaptive linear prediction algorithm operating on a linear combination of the earlier samples of the noisy signal, and

a2) the determination of the said reference signal by a subtraction, with phase-shift compensation, between the original, non-prefiltered noisy signal and the signal delivered by the linear prediction algorithm;

b) determination (24) of an a priori probability of presence/absence of speech on the basis of the respective energy levels in the spectral domain of the noisy signal and of the reference signal; and

c) use of this a priori probability of absence of speech to estimate a noise spectrum and to derive (26) from the noisy signal a denoised estimate of the speech signal.
The method of Claim 1, in which the said reference signal is determined by application in step a2) of a relation of the type: $Ref (k, l) = X (k, l) - X (k, l) |\frac{Y (k, l)}{X (k, l)}|$

where X(k,l) and Y(k,l) are the short-term Fourier transforms of each spectrum segment k of each frame l, respectively of the original noisy signal and of the signal delivered by the linear prediction algorithm.
The method of Claim 1, in which the linear prediction algorithm (10) is an algorithm of least mean squares, LMS, type.
The method of Claim 1, in which the linear prediction algorithm (10) is a recursive adaptive algorithm.
The method of Claim 1, in which step b) comprises the application of an algorithm for estimating the energy of the pseudo-stationary noise component in the reference signal and in the noisy signal.
The method of Claim 5, in which the algorithm for estimating the energy of the pseudo-stationary noise component is an algorithm of minima controlled recursive averaging, MCRA, type.
The method of Claim 1, in which step c) comprises the application of a variable gain algorithm dependent on the probability of presence/absence of speech.
The method of Claim 7, in which the variable gain algorithm is an algorithm of optimally modified log-spectral amplitude, OM-LSA, gain type.