EP2772916B1

EP2772916B1 - Method for suppressing noise in an audio signal by an algorithm with variable spectral gain with dynamically adaptive strength

Info

Publication number: EP2772916B1
Application number: EP14155968.2A
Authority: EP
Inventors: Alexandre Briot
Original assignee: Parrot Automotive SA
Current assignee: Faurecia Clarion Electronics Europe SAS
Priority date: 2013-02-28
Filing date: 2014-02-20
Publication date: 2015-12-02
Anticipated expiration: 2034-02-20
Also published as: FR3002679B1; US20140244245A1; CN104021798A; CN104021798B; EP2772916A1; FR3002679A1

Description

L'invention concerne le traitement de la parole en milieu bruité.The invention relates to the treatment of speech in a noisy environment.

Elle concerne notamment le traitement des signaux de parole captés par des dispositifs de téléphonie de type "mains libres" destinés à être utilisés dans un environnement bruité.It relates in particular to the processing of speech signals picked up by "hands-free" telephony devices intended to be used in a noisy environment.

Ces appareils comportent un ou plusieurs microphones captant non seulement la voix de l'utilisateur, mais également le bruit environnant, bruit qui constitue un élément perturbateur pouvant aller dans certains cas jusqu'à rendre inintelligibles les paroles du locuteur. Il en est de même si l'on veut mettre en oeuvre des techniques de reconnaissance vocale, car il est très difficile d'opérer une reconnaissance de forme sur des mots noyés dans un niveau de bruit élevé.These devices include one or more microphones sensing not only the voice of the user, but also the surrounding noise, noise that is a disruptive element that can go in some cases to make unintelligible the words of the speaker. It is the same if one wants to implement speech recognition techniques, because it is very difficult to perform a form recognition on words embedded in a high noise level.

Cette difficulté liée aux bruits environnants est particulièrement contraignante dans le cas des dispositifs "mains libres" pour véhicules automobiles, qu'il s'agisse d'équipements incorporés au véhicule ou bien d'accessoires en forme de boitier amovible intégrant tous les composants et fonctions de traitement du signal pour la communication téléphonique.This difficulty related to surrounding noise is particularly restrictive in the case of devices "hands free" for motor vehicles, whether in-vehicle equipment or accessories in the form of removable housing incorporating all components and functions signal processing for telephone communication.

En effet, la distance importante entre le micro (placé au niveau de la planche de bord ou dans un angle supérieur du pavillon de l'habitacle) et le locuteur (dont l'éloignement est contraint par la position de conduite) entraine la captation d'un niveau de parole relativement faible par rapport au bruit ambiant, qui rend difficile l'extraction du signal utile noyé dans le bruit. En plus de cette composante stationnaire permanente de bruit de roulement, le milieu très bruité typique de l'environnement automobile présente des caractéristiques spectrales non stationnaires, c'est-à-dire qui évoluent de manière imprévisible en fonction des conditions de conduite : passage sur des chaussées déformées ou pavées, autoradio en fonctionnement, etc.Indeed, the large distance between the microphone (placed at the dashboard or in an upper corner of the roof of the cockpit) and the speaker (whose distance is constrained by the driving position) leads to the capture of a relatively low level of speech compared to the ambient noise, which makes it difficult to extract the useful signal embedded in the noise. In addition to this permanent stationary rolling noise component, the very noisy environment typical of the automotive environment has non-stationary spectral characteristics, that is to say, which evolve unpredictably according to the driving conditions: passage over deformed or paved roads, car radio in operation, etc.

Des difficultés du même genre se présentent dans le cas où le dispositif est un casque audio de type micro/casque combiné utilisé pour des fonctions de communication telles que des fonctions de téléphonie "mains libres", en complément de l'écoute d'une source audio (musique par exemple) provenant d'un appareil sur lequel est branché le casque.Similar difficulties arise in the case where the device is a headset type microphone / headset combined used for communication functions such as "hands-free" telephony functions, in addition to listening to a source audio (music for example) from a device to which the headphones are connected.

Dans ce cas, il s'agit d'assurer une intelligibilité suffisante du signal capté par le micro, c'est-à-dire du signal de parole du locuteur proche (le porteur du casque). Or, le casque peut être utilisé dans un environnement bruyant (métro, rue passante, train, etc.), de sorte que le micro captera non seulement la parole du porteur du casque, mais également les bruits parasites environnants. Le porteur est certes protégé de ce bruit par le casque, notamment s'il s'agit d'un modèle à écouteurs fermés isolant l'oreille de l'extérieur, et encore plus si le casque est pourvu d'un "contrôle actif de bruit". En revanche, le locuteur distant (celui se trouvant à l'autre bout du canal de communication) souffrira des bruits parasites captés par le micro et venant se superposer et interférer avec le signal de parole du locuteur proche (le porteur du casque). En particulier, certains formants de la parole essentiels à la compréhension de la voix sont souvent noyés dans des composantes de bruit couramment rencontrées dans les environnements habituels.In this case, it is a question of ensuring sufficient intelligibility of the signal picked up by the microphone, that is to say the speech signal of the close speaker (the helmet wearer). Now, the headset can be used in a noisy environment (Metro, busy street, train, etc.), so that the microphone will not only pick up the words of the wearer of the helmet, but also the surrounding noise. The wearer is certainly protected from this noise by the helmet, especially if it is a model with closed earphones isolating the ear from the outside, and even more if the headset is provided with an "active control of noise". On the other hand, the distant speaker (the one at the other end of the communication channel) will suffer from the noise picked up by the microphone and being superimposed and interfere with the speech signal of the near speaker (the helmet wearer). In particular, certain speech formers essential to the understanding of the voice are often embedded in noise components commonly encountered in the usual environments.

L'invention concerne plus particulièrement les techniques de débruitage sélectif monocanal, c'est-à-dire opérant sur un unique signal (par opposition aux techniques mettant en oeuvre plusieurs micros dont les signaux sont combinés de façon judicieuse et font l'objet d'une analyse de cohérence spatiale ou spectrale, par exemple par des techniques de type beamforming ou autres). Cependant, elle s'appliquera avec la même pertinence à un signal recomposé à partir de plusieurs micros par une technique de beamforming, dans la mesure où l'invention présentée ici s'applique à un signal scalaire.The invention relates more particularly to single-channel selective denoising techniques, that is to say operating on a single signal (as opposed to techniques using several microphones whose signals are combined judiciously and are subject to a spatial or spectral coherence analysis, for example by beamforming or other techniques). However, it will apply with the same relevance to a signal recomposed from several microphones by a beamforming technique , insofar as the invention presented here applies to a scalar signal.

Dans le cas présent, il s'agit d'opérer le débruitage sélectif d'un signal audio bruité, généralement obtenu après numérisation du signal recueilli par un micro unique de l'équipement de téléphonie.In this case, it is to operate the selective denoising of a noisy audio signal, generally obtained after scanning the signal collected by a single microphone of the telephony equipment.

L'invention vise plus particulièrement un perfectionnement apporté aux algorithmes de réduction de bruit reposant sur un traitement du signal dans le domaine fréquentiel (donc après application d'une transformation de Fourier FFT) consistant à appliquer un gain spectral calculé en fonction de plusieurs estimateurs de probabilité de présence de parole.The invention aims more particularly at an improvement made to the noise reduction algorithms based on signal processing in the frequency domain (thus after application of an FFT Fourier transformation) of applying a calculated spectral gain according to several estimators of probability of presence of speech.

Plus précisément, le signal y issu du microphone est découpé en trames de longueur fixe, chevauchantes ou non, et chaque trame d'indice k est transposée dans le domaine fréquentiel par FFT. Le signal fréquentiel résultant Y(k,l), lui aussi discret, est alors décrit par un ensemble de "bins" fréquentiel (bandes de fréquences) d'indice l, typiquement 128 bins de fréquences positives.More specifically, the signal y from the microphone is cut into frames of fixed length, overlapping or not, and each frame of index k is transposed into the frequency domain by FFT. The resulting frequency signal Y ( k, l ) , which is also discrete, is then described by a set of frequency "bins" (frequency bands) of index l , typically 128 bins of positive frequencies.

Pour chaque trame de signal, un certain nombre d'estimateurs sont mis à jour pour déterminer une probabilité fréquentielle de présence de parole p(k,l). Si la probabilité est grande, le signal sera considéré comme du signal utile (parole) et donc préservé avec un gain spectral G(k,l) = 1 pour le bin considéré. Dans le cas contraire, si la probabilité est faible le signal sera assimilé à du bruit et donc réduit, voire supprimé par application d'un gain spectral d'atténuation très inférieur à 1.For each signal frame, a number of estimators are updated to determine a frequency probability of speech presence p ( k, l ) . If the probability is large, the signal will be considered as useful signal (speech) and thus preserved with a spectral gain G ( k, l ) = 1 for the considered bin. In the opposite case, if the probability is low, the signal will be assimilated to noise and thus reduced or even eliminated by applying a spectral attenuation gain that is much smaller than 1.

En d'autres termes, le principe de cet algorithme consiste à calculer et appliquer au signal utile un "masque fréquentiel" qui conserve l'information utile du signal de parole et élimine le signal parasite de bruit :

Cette technique peut être notamment implémentée par un algorithme de type OM-LSA (Optimally Modified - Log Spectral Amplitude) telle que ceux décrits par :
1. [1] I. Cohen et B. Berdugo, "Speech Enhancement for Non-Stationary Noise Environments", Signal Processing, Vol. 81, No 11, pp. 2403-2418, Nov. 2001 ; et
2. [2] I. Cohen, "Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator", IEEE Signal Processing Letters, Vol. 9, No 4, pp. 113-116, Apr. 2002 .

In other words, the principle of this algorithm consists in calculating and applying to the useful signal a "frequency mask" which preserves the useful information of the speech signal and eliminates the noise noise signal:

This technique can be implemented in particular by an OM-LSA type algorithm ( Optimally Modified - Log Spectral Amplitude) such as those described by:
1. [1] I. Cohen and B. Berdugo, "Speech Enhancement for Non-Stationary Noise Environments," Signal Processing, Vol. 81, No. 11, pp. 2403-2418, Nov. 2001 ; and
2. [2] I. Cohen, "Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator," IEEE Signal Processing Letters, Vol. 9, No. 4, pp. 113-116, Apr. 2002 .

Le US 7 454 010 B1 décrit également un algorithme comparable prenant en compte, pour le calcul des gains spectraux, une information de présence ou non de la voix dans un segment temporel courant.The US 7,454,010 B1 also describes a comparable algorithm taking into account, for the calculation of the spectral gains, information of presence or absence of the voice in a current time segment.

On pourra également se référer au WO 2007/099222 A1 (Parrot ), qui décrit une technique de débruitage mettant en oeuvre un calcul de probabilité de présence de parole.We can also refer to WO 2007/099222 A1 (Parrot ), which describes a denoising technique implementing a probability calculation of presence of speech.

L'efficacité d'une telle technique réside bien entendu dans le modèle de l'estimateur de probabilité de présence de parole qui doit discriminer parole et bruit.The effectiveness of such a technique lies, of course, in the model of the speech probability estimator which must discriminate between speech and noise.

Dans la pratique, l'implémentation d'un tel algorithme se heurte à un certain nombre de défauts, dont les deux principaux sont le "bruit musical" et l'apparition d'une "voix robotisée".In practice, the implementation of such an algorithm encounters a number of defects, the two main ones being the "musical noise" and the appearance of a "robotic voice".

Le "bruit musical" se caractérise par une nappe de bruit de fond résiduel non uniforme, privilégiant certaines fréquences spécifiques. La tonalité du bruit n'est alors plus du tout naturelle, ce qui rend l'écoute perturbante. Ce phénomène résulte de ce que le traitement fréquentiel de débruitage est opéré sans dépendance entre fréquences voisines lors de la discrimination fréquentielle entre parole et bruit, car le traitement n'intègre pas de mécanisme pour prévenir deux gains spectraux voisins très différents. Or, dans les périodes de bruit seul, il faudrait idéalement un gain d'atténuation uniforme pour préserver la tonalité du bruit ; mais en pratique, si les gains spectraux ne sont pas homogènes, le bruit résiduel devient "musical" avec l'apparition de notes fréquentielles aux fréquences moins atténuées, correspondant à des bins faussement détectés comme contenant du signal utile. On notera que ce phénomène est d'autant plus marqué que l'on autorise l'application de gains d'atténuation importants."Music noise" is characterized by a non-uniform residual background noise layer, favoring certain specific frequencies. The tone of the noise is then no longer natural, which makes listening disturbing. This phenomenon results from the fact that the frequency treatment of denoising is operated without dependence between neighboring frequencies during frequency discrimination between speech and noise, because the treatment does not include a mechanism to prevent two very different neighboring spectral gains. However, in periods of noise alone, ideally a uniform attenuation gain is needed to preserve the tone of the noise; but in practice, if the spectral gains are not homogeneous, the residual noise becomes "musical" with the appearance of frequency notes at less attenuated frequencies, corresponding to bins falsely detected as containing useful signal. It should be noted that this phenomenon is all the more marked as it allows the application of significant attenuation gains.

Le phénomène de "voix robotisée" ou "voix métallique", quant à lui, se présente lorsque l'on choisit d'opérer une réduction de bruit très agressive, avec des gains spectraux d'atténuation importants. En présence de parole, des fréquences correspondant à de la parole mais qui sont faussement détectées comme étant du bruit seront fortement atténuées, rendant la voix moins naturelle, voire totalement artificielle ("robotisation" de la voix).The phenomenon of "robotic voice" or "metallic voice", meanwhile, occurs when one chooses to operate a very aggressive noise reduction, with significant spectral attenuation gains. In the presence of speech, frequencies corresponding to speech but which are falsely detected as noise will be strongly attenuated, making the voice less natural or even totally artificial ("robotization" of the voice).

Le paramétrage d'un tel algorithme consiste donc à trouver un compromis sur l'agressivité du débruitage, de manière à enlever un maximum de bruit sans que les effets indésirables de l'application de gains spectraux d'atténuation trop importants ne deviennent trop perceptibles. Ce dernier critère se révèle toutefois extrêmement subjectif, et sur un groupe témoin d'utilisateurs relativement large il s'avère difficile de trouver un réglage de compromis qui puisse faire l'unanimité.The parameterization of such an algorithm therefore consists in finding a compromise on the aggressiveness of the denoising, so as to remove a maximum of noise without the undesirable effects of the application of too large attenuation spectral gains becoming too perceptible. This last criterion proves however extremely subjective, and on a comparatively large control group of users it is difficult to find a compromise setting that can be unanimous.

Pour minimiser ces défauts, inhérents à une technique de débruitage par application d'un gain spectral, le modèle "OM-LSA" prévoit de fixer une borne inférieure G_min pour le gain d'atténuation (exprimé suivant une échelle logarithmique, ce gain d'atténuation correspond donc dans la suite de ce document à une valeur négative) appliqué aux zones identifiées comme du bruit, de manière à s'interdire de trop débruiter pour limiter l'apparition des défauts évoqués plus haut. Cette solution n'est cependant pas optimale : certes, elle contribue à faire disparaitre les effets indésirables d'une réduction de bruit excessive, mais dans le même temps elle limite les performances du débruitage.To minimize these defects, inherent to a denoising technique by applying a spectral gain, the "OM-LSA" model provides for setting a lower bound G _min for the attenuation gain (expressed on a logarithmic scale, this gain of attenuation therefore corresponds in the remainder of this document to a negative value) applied to the zones identified as noise, so as to prevent too much denoising to limit the appearance of the defects mentioned above. This solution is however not optimal: it certainly helps to eliminate the undesirable effects of excessive noise reduction, but at the same time it limits the performance of denoising.

Le problème de l'invention est de pallier cette limitation, en rendant plus performant le système de réduction de bruit par application d'un gain spectral (typiquement selon un modèle OM-LSA), tout en respectant les contraintes évoquées plus haut, à savoir réduire efficacement le bruit sans altérer l'aspect naturel de la parole (en présence de parole) ni celui du bruit (en présence de bruit). En d'autres termes, il convient de rendre imperceptibles par le locuteur distant les effets indésirables du traitement algorithmique, tout en atténuant le bruit de manière importante.The problem of the invention is to overcome this limitation, by making the noise reduction system more efficient by applying a spectral gain (typically according to an OM-LSA model), while respecting the constraints mentioned above, namely effectively reduce noise without altering the natural appearance of speech (in the presence of speech) or of noise (in the presence of noise). In other words, the undesirable effects of the algorithmic processing must be made imperceptible by the remote speaker, while at the same time attenuating the noise significantly.

L'idée de base de l'invention consiste à moduler le calcul du gain spectral G_OMLSA - calculé dans le domaine fréquentiel pour chaque bin - par un indicateur global, observé au niveau de la trame temporelle et non plus au niveau d'un unique bin de fréquence.The basic idea of the invention is to modulate the calculation of the spectral gain G _OMLSA - calculated in the frequency domain for each bin - by a global indicator observed at the time frame and no longer at a single level. frequency bin.

Cette modulation sera opérée par une transformation directe de la borne inférieure G_min du gain d'atténuation - borne qui est un scalaire communément désigné "dureté de débruitage" - en une fonction temporelle dont la valeur sera déterminée en fonction d'un descripteur temporel (ou "variable globale") reflété par l'état des divers estimateurs de l'algorithme. Ces derniers seront choisis en fonction de leur pertinence pour décrire des situations connues pour lesquelles on sait que le choix de la dureté de débruitage G_min peut être optimisé.This modulation will be performed by a direct transformation of the lower limit G _min of the attenuation gain - a terminal which is a scalar commonly referred to as "denoising hardness" - into a time function whose value will be determined according to a temporal descriptor ( or "global variable") reflected by the state of the various estimators of the algorithm. These will be chosen according to their relevance to describe known situations for which it is known that the choice of denoising hardness G _min can be optimized.

Par la suite et en fonction des cas de figure, la modulation temporelle appliquée à ce gain d'atténuation G_min logarithmique pourra correspondre soit à un incrément soit à un décrément : un décrément sera associé à une dureté de réduction de bruit plus grande (gain logarithmique plus grand en valeur absolue), inversement un incrément de ce gain logarithmique négatif sera associé à une valeur absolue plus petite donc une dureté de réduction de bruit plus faible.Subsequently and depending on the case, the temporal modulation applied to this logarithmic G _min attenuation gain may correspond to either an increment or a decrement: a decrement will be associated with a greater noise reduction hardness (gain logarithmic greater in absolute value), conversely an increment of this negative logarithmic gain will be associated with a smaller absolute value, hence a lower noise reduction hardness.

En effet, on constate qu'une observation à l'échelle de la trame peut bien souvent permettre de corriger certains défauts de l'algorithme, notamment dans des zones très bruitées où il peut parfois faussement détecter une fréquence de bruit comme étant une fréquence de parole : ainsi, si une trame de bruit seul est détectée (au niveau de la trame), on pourra débruiter de façon plus agressive sans pour autant introduire de bruit musical, grâce à un débruitage plus homogène.Indeed, it can be seen that a frame-scale observation can often correct certain defects in the algorithm, particularly in very noisy areas where it can sometimes falsely detect a noise frequency as a frequency of speech: thus, if a frame of only noise is detected (at the level of the frame), one will be able to denoise in a more aggressive way without introducing musical noise, thanks to a more homogeneous denoising.

Inversement, sur une période de parole bruitée, on pourra s'autoriser à moins débruiter afin de parfaitement préserver la voix tout en veillant à ce que la variation d'énergie du bruit de fond résiduel ne soit pas perceptible. On dispose ainsi d'un double levier (dureté et homogénéité) pour moduler l'importance du débruitage selon le cas considéré - phase de bruit seul ou bien phase de parole -, la discrimination entre l'un ou l'autre cas résultant d'une observation à l'échelle de la trame temporelle :

dans le premier mode de réalisation, l'optimisation consistera à moduler dans le sens adéquat la valeur de la dureté de débruitage G_min pour mieux réduire le bruit en phase de bruit seul, et mieux préserver la voix en phase de parole ;

Conversely, over a period of noisy speech, we can allow less denoising in order to perfectly preserve the voice while ensuring that the energy variation of the residual background noise is not noticeable. We thus have a double lever (hardness and homogeneity) to modulate the importance of denoising according to the case considered - noise phase alone or speech phase -, the discrimination between one or the other case resulting from an observation on the scale of the time frame:

in the first embodiment, the optimization will consist in modulating in the appropriate direction the value of the denoising hardness G _min to better reduce noise in the noise phase alone, and better preserve the voice in the speech phase;

Plus précisément, l'invention propose un procédé de débruitage d'un signal audio par application d'un algorithme à gain spectral variable fonction d'une probabilité de présence de parole, comportant de manière en elle-même connue les étapes successives suivantes :

a) génération de trames temporelles successives du signal audio bruité numérisé ;
b) application d'une transformation de Fourier aux trames générées à l'étape a), de manière à produire pour chaque trame temporelle de signal un spectre de signal avec une pluralité de bandes de fréquences prédéterminées ;
c) dans le domaine fréquentiel :
- c1) estimation, pour chaque bande de fréquences de chaque trame temporelle courante, d'une probabilité de présence de parole ;
- c3) calcul d'un gain spectral, propre à chaque bande de fréquence de chaque trame temporelle courante, en fonction de : i) une estimation de l'énergie du bruit dans chaque bande de fréquences, ii) la probabilité de présence de parole estimée à l'étape c1), et iii) une valeur scalaire de gain minimal représentative d'un paramètre de dureté du débruitage ;
- c4) réduction sélective de bruit par application à chaque bande de fréquences du gain calculé à l'étape c3) ;
d) application d'une transformation de Fourier inverse au spectre de signal constitué des bandes de fréquences produites à l'étape c4), de manière à délivrer pour chaque spectre une trame temporelle de signal débruité ; et
e) reconstitution d'un signal audio débruité à partir des trames temporelles délivrées à l'étape d).

More specifically, the invention proposes a method of denoise of an audio signal by applying a variable spectral gain algorithm according to a probability of presence of speech, comprising in a manner known per se the following successive steps:

a) generating successive time frames of the digitized noisy audio signal;
b) applying a Fourier transform to the frames generated in step a), so as to produce for each signal time frame a signal spectrum with a plurality of predetermined frequency bands;
c) in the frequency domain:
- c1) estimating, for each frequency band of each current time frame, a probability of presence of speech;
- c3) calculating a spectral gain, specific to each frequency band of each current time frame, as a function of: i) an estimate of the noise energy in each frequency band, ii) the probability of presence of estimated speech in step c1), and iii) a minimum gain scalar value representative of a denoising hardness parameter;
- c4) selective reduction of noise by applying to each frequency band the gain calculated in step c3);
d) applying an inverse Fourier transform to the signal spectrum consisting of the frequency bands produced in step c4), providing for each spectrum a time frame of denoised signal; and
e) reconstituting an audio signal denoised from the time frames delivered in step d).

De façon caractéristique de l'invention :

ladite valeur scalaire de gain minimal est une valeur modulable de manière dynamique à chaque trame temporelle successive ; et
le procédé comporte en outre, préalablement à l'étape c3) de calcul du gain spectral, une étape de :
- c2) calcul, pour la trame temporelle courante, de ladite valeur modulable en fonction d'une variable globale observée au niveau de la trame temporelle courante pour toutes les bandes de fréquences ; et
ledit calcul de l'étape c2) comprend l'application, pour la trame temporelle courante, d'un incrément/décrément apporté à une valeur paramétrée nominale dudit gain minimal.

Characteristically, the invention

said minimum gain scalar value is a dynamically scalable value at each successive time frame; and
the method further comprises, prior to step c3) of calculating the spectral gain, a step of:
- c2) calculating, for the current time frame, said scalable value as a function of a global variable observed at the current time frame for all the frequency bands; and
said calculation of step c2) comprises the application, for the current time frame, of an increment / decrement brought to a nominal parameterized value of said minimum gain.

Dans une première implémentation de l'invention, la variable globale est un rapport signal sur bruit de la trame temporelle courante, évalué dans le domaine temporel.In a first implementation of the invention, the global variable is a signal-to-noise ratio of the current time frame evaluated in the time domain.

La valeur scalaire de gain minimal peut notamment être calculée à l'étape c2) par application de la relation : $G_{\min} (k) = G_{\min} + Δ G_{\min} ({SNR}_{y} (k))$

k étant l'indice de la trame temporelle courante,
G_min (k) étant le gain minimal à appliquer à la trame temporelle courante,
G_min étant ladite valeur nominale paramétrée du gain minimal,
ΔG_min (k) étant ledit incrément/décrément apporté à G_min , et
SNR_y (k) étant le rapport signal sur bruit de la trame temporelle courante.

The scalar value of minimum gain can notably be calculated in step c2) by applying the relation:

{BOY WUT}_{\min} (k) = {BOY WUT}_{\min} + Δ {BOY WUT}_{\min} ({SNR}_{there} (k))

where k is the index of the current time frame,
G _min ( k ) being the minimum gain to be applied to the current time frame,
G _min being said parameterized nominal value of the minimum gain,
Δ G _min (k) being said increment / decrement brought to G _min , and
SNR _y (k) being the signal-to-noise ratio of the current time frame.

Dans une deuxième implémentation de l'invention, la variable globale est une probabilité moyenne de parole, évaluée au niveau de la trame temporelle courante.In a second implementation of the invention, the global variable is an average probability of speech, evaluated at the level of the current time frame.

La valeur scalaire de gain minimal peut notamment être calculée à l'étape c2) par application de la relation : $G_{\min} (k) = G_{\min} + (P_{speech} (k) - 1) . Δ_{1} G_{\min} + P_{speech} (k) . Δ_{2} G_{\min}$

k étant l'indice de la trame temporelle courante,
G_min (k) étant le gain minimal à appliquer à la trame temporelle courante,
G_min étant ladite valeur nominale paramétrée du gain minimal,
P_speech (k) étant la probabilité moyenne de parole évaluée au niveau de la trame temporelle courante,
Δ₁ G_min étant ledit incrément/décrément, apporté à G_min en phase de bruit, et
Δ₂ G_min étant ledit incrément/décrément, apporté à G_min en phase de parole.

{BOY WUT}_{\min} (k) = {BOY WUT}_{\min} + (P_{speech} (k) - 1) . Δ_{1} {BOY WUT}_{\min} + P_{speech} (k) . Δ_{2} {BOY WUT}_{\min}

where k is the index of the current time frame,
G _min ( k ) being the minimum gain to be applied to the current time frame,
G _min being said parameterized nominal value of the minimum gain,
P _speech ( k ) being the average probability of speech evaluated at the level of the current time frame,
Δ ₁ G _min being said increment / decrement, brought to G _min in the noise phase, and
Δ ₂ G _min being said increment / decrement, brought to G _min in speech phase.

La probabilité moyenne de parole peut notamment être évaluée au niveau de la trame temporelle courante par application de la relation : $P_{speech} (k) = \frac{1}{N} \sum_{l}^{N} p (k, l)$

l étant l'indice de la bande de fréquences,
N étant le nombre de bandes de fréquences dans le spectre, et
p(k,l) étant la probabilité de présence de parole de la bande de fréquences d'indice l de la trame temporelle courante.

The average probability of speech can notably be evaluated at the level of the current time frame by applying the relation:

P_{speech} (k) = \frac{1}{NOT} Σ_{l}^{NOT} p (k, l)

where l is the index of the frequency band,
N being the number of frequency bands in the spectrum, and
p ( k , l ) being the probability of presence of speech of the index frequency band I of the current time frame.

Dans une troisième implémentation de l'invention, la variable globale est un signal booléen de détection d'activité vocale pour la trame temporelle courante, évalué dans le domaine temporel par analyse de la trame temporelle et/ou au moyen d'un détecteur externe.In a third implementation of the invention, the global variable is a Boolean voice activity detection signal for the current time frame, evaluated in the time domain by analysis of the time frame and / or by means of an external detector.

La valeur scalaire de gain minimal peut notamment être calculée à l'étape c2) par application de la relation : $G_{\min} (k) = G_{\min} + VAD (k) . Δ G_{\min}$

k étant l'indice de la trame temporelle courante,
G_min (k) étant le gain minimal à appliquer à la trame temporelle courante,
G_min étant ladite valeur nominale paramétrée du gain minimal,
VAD (k) étant la valeur du signal booléen de détection d'activité vocale pour la trame temporelle courante, et
ΔG_min étant ledit incrément/décrément apporté à G_min .

{BOY WUT}_{\min} (k) = {BOY WUT}_{\min} + VAD (k) . Δ {BOY WUT}_{\min}

where k is the index of the current time frame,
G _min ( k ) being the minimum gain to be applied to the current time frame,
G _min being said parameterized nominal value of the minimum gain,
VAD (k) being the value of the Boolean voice activity detection signal for the current time frame, and
Δ G _min being said increment / decrement brought to G _min .

On va maintenant décrire un exemple de mise en oeuvre du dispositif de l'invention, en référence aux dessins annexés où les mêmes références numériques désignent d'une figure à l'autre des éléments identiques ou fonctionnellement semblables.

La Figure 1 illustre de façon schématique, sous forme de blocs fonctionnels, la manière dont est réalisé un traitement de débruitage de type OM-LSA selon l'état de la technique.
La Figure 2 illustre le perfectionnement apporté par l'invention à la technique de débruitage de la Figure 1.

An embodiment of the device of the invention will now be described with reference to the appended drawings in which the same reference numerals designate identical or functionally similar elements from one figure to another.

The Figure 1 schematically illustrates, in the form of functional blocks, the manner in which an OM-LSA type denoising treatment according to the state of the art is carried out.
The Figure 2 illustrates the improvement brought by the invention to the technique of denoising the Figure 1 .

Le processus de l'invention est mis en oeuvre par des moyens logiciels, schématisés sur les figures par un certain nombre de blocs fonctionnels correspondant à des algorithmes appropriés exécutés par un microcontrôleur ou un processeur numérique de signal. Bien que, pour la clarté de l'exposé, les différentes fonctions soient présentées sous forme de modules distincts, elles mettent en oeuvre des éléments communs et correspondent en pratique à une pluralité de fonctions globalement exécutées par un même logiciel.The process of the invention is implemented by software means, schematized in the figures by a number of functional blocks corresponding to appropriate algorithms executed by a microcontroller or a digital signal processor. Although, for the sake of clarity, the various functions are presented as separate modules, they implement common elements and correspond in practice to a plurality of functions globally executed by the same software.

OM-LSA denoising algorithm according to the state of the art

La Figure 1 illustre de façon schématique, sous forme de blocs fonctionnels, la manière dont est réalisé un traitement de débruitage de type OM-LSA selon l'état de la technique.The Figure 1 schematically illustrates, in the form of functional blocks, the manner in which an OM-LSA type denoising treatment according to the state of the art is carried out.

Le signal numérisé y(n) = x(n) + d(n) comprenant une composante de parole x(n) et une composante de bruit d(n) (n étant le rang de l'échantillon) est découpé (bloc 10) en segments ou trames temporelles y(k) (k étant l'indice de la trame) de longueur fixe, chevauchantes ou non, habituellement des trames de 256 échantillons pour un signal échantillonné à 8 kHz (standard téléphonique narrowband).The digitized signal y ( n ) = x (n) + d ( n ) comprising a speech component x ( n ) and a noise component d (n) ( n being the rank of the sample) is cut out (block 10 ) in segments or time frames y ( k ) (k being the frame index) of fixed length, overlapping or not, usually frames of 256 samples for a signal sampled at 8 kHz ( narrowband telephone standard).

Chaque trame temporelle d'indice k est ensuite transposée dans le domaine fréquentiel par une transformation rapide de Fourier FFT (bloc 12) : le signal résultant obtenu ou spectre Y(k,l), lui aussi discret, est alors décrit par un ensemble de bandes de fréquences ou "bins" fréquentiels (l étant l'indice de bin), par exemple 128 bins de fréquences positives. Un gain spectral G = G_OMLSA (k,l), propre à chaque bin, est appliqué (bloc 14) au signal fréquentiel Y(k,l), pour donner un signal X̂ (k, l) : $\hat{X} (k, l) = G_{OMLSA} (k, l) . Y (k, l)$

Each time frame of index k is then transposed into the frequency domain by a fast Fourier transform FFT (block 12): the resulting obtained signal or spectrum Y ( k , l ), also discrete, is then described by a set of Frequency bands or "bins" (where l is the bin index), for example 128 bins of positive frequencies. A spectral gain G = G _OMLSA ( k, l), specific to each bin, is applied (block 14) to the frequency signal Y ( k , l ), to give a signal X ( k, l ):

\hat{X} (k, l) = {BOY WUT}_{OMLSA} (k, l) . Y (k, l)

Le gain spectral G_OMLSA(k,l) est calculé (bloc 16) en fonction d'une part d'une probabilité de présence de parole p(k,l), qui est une probabilité fréquentielle évaluée (bloc 18) pour chaque bin, et d'autre part d'un paramètre G_min , qui est une valeur scalaire de gain minimal, dénommée couramment "dureté de débruitage". Ce paramètre G_min fixe une borne inférieure au gain d'atténuation appliqué sur les zones identifiées comme du bruit, afin d'éviter que les phénomènes de bruit musical et de voix robotisée ne deviennent trop marqués du fait de l'application de gains spectraux d'atténuation trop importants et/ou hétérogènes.The spectral gain G _OMLSA (k, l) is calculated (block 16) as a function of a part of a probability of presence of speech p ( k , l ), which is an estimated frequency probability (block 18) for each bin and on the other hand a parameter G _min , which is a scalar value of minimum gain, commonly referred to as "denoising hardness". This parameter G _min sets a lower limit to the attenuation gain applied to the zones identified as noise, in order to avoid that the phenomena of musical noise and robotic voice become too marked due to the application of spectral gains of too much and / or heterogeneous attenuation.

Le gain spectral G_OMLSA(k,l) calculé est de la forme : $G_{OMLSA} (k, l) = {\{G (k, l)\}}^{p (k, l)} . G_{\min}^{1 - p (k, l)}$

The spectral gain G _OMLSA (k, l) calculated is of the form:

{BOY WUT}_{OMLSA} (k, l) = {\{BOY WUT (k, l)\}}^{p (k, l)} . {BOY WUT}_{\min}^{1 - p (k, l)}

Le calcul du gain spectral et celui de la probabilité de présence de parole sont donc avantageusement implémentés sous forme d'un algorithme de type OM-LSA (Optimally Modified - Log Spectral Amplitude) tel que celui décrit dans l'article (précité) :

[2] I. Cohen, "Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator", IEEE Signal Processing Letters, Vol. 9, No 4, pp. 113-116, Apr. 2002 .

The calculation of the spectral gain and that of the probability of presence of speech are therefore advantageously implemented in the form of an OM-LSA (Optimally Modified Log Spectral Amplitude) type algorithm such as that described in the article (cited above):

[2] I. Cohen, "Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator," IEEE Signal Processing Letters, Vol. 9, No. 4, pp. 113-116, Apr. 2002 .

Essentiellement, l'application d'un gain nommé "gain LSA" (Log-Spectral Amplitude) permet de minimiser la distance quadratique moyenne entre le logarithme de l'amplitude du signal estimé et le logarithme de l'amplitude du signal de parole originel. Ce critère se montre adapté, car la distance choisie est en meilleure adéquation avec le comportement de l'oreille humaine et donne donc qualitativement de meilleurs résultats.Essentially, the application of a gain called Log-Spectral Amplitude (LSA ) is used to minimize the mean squared distance between the logarithm of the amplitude of the estimated signal and the logarithm of the amplitude of the original speech signal. This criterion is adapted because the distance chosen is in better adequacy with the behavior of the human ear and thus gives qualitatively better results.

Dans tous les cas, il s'agit de diminuer l'énergie des composantes fréquentielles très parasitées en leur appliquant un gain faible, tout en laissant intactes (par l'application d'un gain égal à 1) celles qui le sont peu ou pas du tout.In all cases, it is a question of reducing the energy of the highly parasitized frequency components by applying them a weak gain, while leaving intact (by the application of a gain equal to 1) those which are it little or not at all.

L'algorithme "OM-LSA" (Optimally-Modified LSA) améliore le calcul du gain LSA en le pondérant par la probabilité conditionnelle p(k,l) de présence de parole ou SPP (Speech Presence Probability), pour le calcul du gain final : la réduction de bruit appliquée est d'autant plus importante (c'est-à-dire que le gain appliqué est d'autant plus faible) que la probabilité de présence de parole est faible.The algorithm "OM-LSA" ( Optimally-Modified LSA ) improves the computation of the LSA gain by weighting it by the conditional probability p ( k , l ) of presence of speech or SPP ( Speech Presence Probability ), for the computation of the gain final: the noise reduction applied is all the more important (that is to say that the applied gain is even lower) that the probability of presence of speech is low.

La probabilité de présence de parole p(k,l) est un paramètre pouvant prendre plusieurs valeurs différentes comprises entre 0 et 100 %. Ce paramètre est calculé selon une technique en elle-même connue, dont des exemples sont notamment exposés dans :

[3] I. Cohen et B. Berdugo, "Two-Channel Signal Detection and Speech Enhancement Based on the Transient Beam-to-Reference Ratio", IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2003, Hong-Kong, pp. 233-236, Apr. 2003 .

The probability of presence of speech p ( k , l ) is a parameter that can take several different values between 0 and 100%. This parameter is calculated according to a technique in itself known, examples of which are in particular exposed in:

[3] I. Cohen and B. Berdugo, "Two-Channel Signal Detection and Speech Enhancement Based on the Transient Beam-to-Reference Ratio," IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2003, Hong Kong, pp. 233-236, Apr. 2003 .

Comme fréquemment dans ce domaine, le procédé décrit n'a pas pour objectif d'identifier précisément sur quelles composantes fréquentielles de quelles trames la parole est absente, mais plutôt de donner un indice de confiance entre 0 et 1, une valeur 1 indiquant que la parole est absente à coup sûr (selon l'algorithme) tandis qu'une valeur 0 déclare le contraire. Par sa nature, cet indice est assimilé à la probabilité d'absence de la parole a priori, c'est-à-dire la probabilité que la parole soit absente sur une composante fréquentielle donnée de la trame considérée. Il s'agit bien sûr d'une assimilation non rigoureuse, dans le sens que même si la présence de la parole est probabiliste ex ante, le signal capté par le micro ne présente à chaque instant que l'un de deux états distincts : à l'instant considéré, il peut soit comporter de la parole soit ne pas en contenir. En pratique, cette assimilation donne toutefois de bons résultats, ce qui justifie son utilisation.As is frequently the case in this field, the method described is not intended to identify precisely on which frequency components of which frames the speech is absent, but rather to give a confidence index between 0 and 1, a value 1 indicating that the speech is absent for sure (according to the algorithm) while a value 0 declares the opposite. By its nature, this index is likened to the probability of absence of speech a priori, that is to say the probability that speech is absent on a given frequency component of the frame considered. This is of course a non-rigorous assimilation, in the sense that even if the presence of speech is probabilistic ex ante, the signal picked up by the microphone presents at each moment only one of two distinct states: at the moment considered, it can either include speech or not contain it. In practice, however, this assimilation gives good results, which justifies its use.

On pourra également se référer au WO 2007/099222 A1 (Parrot ), qui décrit en détail une technique de débruitage dérivée de ce principe, mettant en oeuvre un calcul de probabilité de présence de parole.We can also refer to WO 2007/099222 A1 (Parrot ), which describes in detail a denoising technique derived from this principle, implementing a probability calculation of presence of speech.

Le signal résultant X̂ (k,l) = G_OMLSA (k,l). Y(k,l), c'est-à-dire le signal utile Y(k,l) auquel a été appliqué le masque fréquentiel G_OMLSA (k,l), fait ensuite l'objet d'une transformation de Fourier inverse iFFT (bloc 20), pour repasser du domaine fréquentiel au domaine temporel. Les trames temporelles obtenues sont ensuite rassemblées (bloc 22) pour donner un signal débruité numérisé x̂(n).The resulting signal X ( k , l ) = G _OMLSA ( k, l ). Y ( k , l ), that is to say the useful signal Y ( k , l ) to which the frequency mask G _OMLSA ( k , l ) has been applied, is then subjected to an inverse Fourier transformation. iFFT (block 20), to go back from the frequency domain to the time domain. The resulting time frames are then collected (block 22) to give a digitized denoised signal x ( n ).

OM-LSA denoising algorithm according to the invention

La Figure 2 illustre les modifications apportées à l'algorithme que l'on vient d'exposer. Les blocs portant les mêmes références numériques correspondent à des fonctions identiques ou similaires à celles exposées plus haut, de même que les références des divers signaux traités.The Figure 2 illustrates the changes made to the algorithm just described. Blocks bearing the same numerical references correspond to functions that are identical or similar to those described above, as well as the references of the various processed signals.

Dans l'implémentation connue de la Figure 1, la valeur scalaire G_min du gain minimal représentatif de la dureté de débruitage était choisie plus ou moins empiriquement, de telle sorte que la dégradation de la voix reste peu audible, tout en assurant une atténuation acceptable du bruit. Comme on l'a exposé en introduction, il est cependant souhaitable de débruiter plus agressivement en phase de bruit seul, mais sans pour autant introduire de bruit musical ; inversement, sur une période de parole bruitée, on peut s'autoriser à moins débruiter afin de parfaitement préserver la voix tout en veillant à ce que la variation d'énergie du bruit de fond résiduel ne soit pas perceptible.In the known implementation of the Figure 1 , the scalar value G _min of the minimal gain representative of the denoising hardness was chosen more or less empirically, so that the degradation of the voice remains low audible, while ensuring an acceptable attenuation of the noise. As stated in the introduction, however, it is desirable to denoise more aggressively in the noise phase alone, but without introducing musical noise; conversely, over a period of noisy speech, we can allow less denoising to perfectly preserve the voice while ensuring that the energy variation of the residual background noise is not noticeable.

On peut disposer selon le cas (phase de bruit seul ou bien phase de parole) d'un double intérêt à moduler la dureté du débruitage : celle-ci sera modulée en faisant varier dynamiquement la valeur scalaire de G_min , dans le sens adéquat qui réduira le bruit en phase de bruit seul et préservera mieux la voix en phase de parole.Depending on the case (noise phase alone or speech phase), there may be a double interest in modulating the hardness of denoising: this will be modulated by dynamically varying the scalar value of G _min , in the appropriate direction that will reduce noise in the noise phase alone and will better preserve voice in the speech phase.

Pour ce faire, la valeur scalaire G_min, initialement constante, est transformée (bloc 24) en une fonction temporelle G_min (k) dont la valeur sera déterminée en fonction d'une variable globale (également désignée "descripteur temporel"), c'est-à-dire d'une variable considérée globalement au niveau de la trame et non pas du bin fréquentiel. Cette variable globale peut être reflétée par l'état d'un ou plusieurs estimateurs différents déjà calculés par l'algorithme, qui seront choisis selon le cas en fonction de leur pertinence.To do this, the scalar value G _min , initially constant, is transformed (block 24) into a time function G _min ( k ) whose value will be determined according to a global variable (also called "temporal descriptor"). that is, a variable considered globally at the level of the frame and not the frequency bin. This global variable can be reflected by the state of one or more different estimators already calculated by the algorithm, which will be chosen according to the case according to their relevance.

Ces estimateurs peuvent notamment être : i) un rapport signal sur bruit, ii) une probabilité moyenne de présence de parole et/ou iii) une détection d'activité vocale. Dans tous ces exemples, la dureté de débruitage G_min devient une fonction temporelle G_min (k) définie par les estimateurs, eux-mêmes temporels, permettant de décrire des situations connues pour lesquelles on souhaite moduler la valeur de G_min afin d'influer sur la réduction de bruit en modifiant de façon dynamique le compromis débruitage/dégradation du signal.These estimators can be: i) a signal-to-noise ratio, ii) an average probability of presence of speech and / or iii) a voice activity detection. In all these examples, the denoising hardness G _min becomes a time function G _min ( k ) defined by the estimators, themselves temporal, making it possible to describe known situations for which it is desired to modulate the value of G _{min in} order to influence on noise reduction by dynamically modifying the signal denoise / degradation compromise.

On notera incidemment que, pour que cette modulation dynamique de la dureté ne soit pas perceptible par l'auditeur, il convient de prévoir un mécanisme pour prévenir des variations brutales de G_min (k), par exemple par une technique conventionnelle de lissage temporel. On évitera ainsi que des variations temporelles brusques de la dureté G_min (k) ne soient audibles sur le bruit résiduel, qui est très souvent stationnaire dans le cas par exemple d'un automobiliste en condition de roulage.Incidentally, in order for this dynamic modulation of the hardness not to be perceptible by the listener, it is necessary to provide a mechanism for preventing abrupt variations of G _min ( k ), for example by a conventional temporal smoothing technique. It will thus be avoided that abrupt temporal variations in the hardness G _min ( k ) are audible on the residual noise, which is very often stationary in the case for example of a motorist in rolling condition.

Descripteur temporel : rapport signal sur bruitTime descriptor: signal-to-noise ratio

Le point de départ de cette première implémentation est la constatation de ce qu'un signal de parole capté dans un environnement silencieux n'a que peu, voire pas, besoin d'être débruité, et qu'un débruitage énergique appliqué à un tel signal conduirait rapidement à des artefacts audibles, sans que le confort d'écoute ne soit amélioré du seul point de vue du bruit résiduel.The starting point of this first implementation is the observation that a speech signal picked up in a quiet environment has little or no need to be de-noised, and that an energetic denoising applied to such a signal would lead quickly to audible artifacts, without the comfort of listening improved from the point of view of residual noise.

À l'inverse, un signal excessivement bruité peu rapidement devenir inintelligible ou susciter une fatigue progressive à l'écoute ; dans un tel cas le bénéfice d'un débruitage important sera indiscutable, même au prix d'une dégradation audible (toutefois raisonnable et contrôlée) de la parole.Conversely, an excessively noisy signal can quickly become unintelligible or cause progressive listening fatigue; in such a case the benefit of a large denoising will be indisputable, even at the cost of an audible (however reasonable and controlled) degradation of speech.

En d'autres termes, la réduction de bruit sera d'autant plus bénéfique pour la compréhension du signal utile que le signal non traité est bruité.In other words, the noise reduction will be all the more beneficial for the understanding of the useful signal that the untreated signal is noisy.

Ceci peut être pris en compte en modulant le paramètre de dureté G_min en fonction du rapport signal sur bruit a priori ou du niveau de bruit courant du signal traité : $G_{\min} (k) = G_{\min} + Δ G_{\min} ({SNR}_{y} (k))$

G_min (k) étant le gain minimal à appliquer à la trame temporelle courante,
G_min étant une valeur nominale paramétrée de ce gain minimal,
ΔG_min (k) étant l'incrément/décrément apporté à la valeur G_min , et
SNR_y (k) étant le rapport signal sur bruit de la trame courante, évalué dans le domaine temporel (bloc 26), correspondant à la variable appliquée sur l'entrée n° ① du bloc 24 (ces "entrées" étant symboliques et n'ayant qu'une valeur illustrative des différentes possibilités alternatives de mise en oeuvre de l'invention).

This can be taken into account by modulating the hardness parameter G _min as a function of the signal-to-noise ratio a priori or the noise level of the processed signal:

{BOY WUT}_{\min} (k) = {BOY WUT}_{\min} + Δ {BOY WUT}_{\min} ({SNR}_{there} (k))

G _min ( k ) being the minimum gain to be applied to the current time frame,
G _min being a parameterized nominal value of this minimum gain,
Δ G _min (k) being the increment / decrement brought to the value G _min , and
SNR _y ( k ) being the signal-to-noise ratio of the current frame, evaluated in the time domain (block 26), corresponding to the variable applied on the input n ° ① of the block 24 (these "entries" being symbolic and n having an illustrative value of the different alternative possibilities of implementing the invention).

Descripteur temporel : probabilité moyenne de présence de paroleTemporal descriptor: mean probability of speech

Un autre critère pertinent pour moduler la dureté de la réduction peut être la présence de parole pour la trame temporelle considérée.Another relevant criterion for modulating the hardness of the reduction may be the presence of speech for the time frame considered.

Avec l'algorithme conventionnel, lorsqu'on tente d'augmenter la dureté de débruitage G_min , le phénomène de "voix robotisée" apparait avant celui de "bruit musical". Ainsi, il parait possible et intéressant d'appliquer une dureté de débruitage plus grande dans une phase de bruit seul, en modulant simplement le paramètre de dureté de débruitage par un indicateur global de présence de parole : en période de bruit seul, le bruit résiduel - à l'origine de la fatigue d'écoute - sera réduit par application d'une dureté plus importante, et ce sans contrepartie puisque la dureté en phase de parole peut rester inchangée.With the conventional algorithm, when one tries to increase the denoising hardness G _min , the phenomenon of "robotic voice" appears before that of "musical noise". Thus, it seems possible and interesting to apply a greater denoising hardness in a noise phase alone, by simply modulating the denoising hardness parameter by a global indicator of presence of speech: in noise-only period, the residual noise - at the origin of the listening fatigue - will be reduced by application of a greater hardness, and this without counterpart since the hardness in speech phase can remain unchanged.

Comme l'algorithme de réduction de bruit repose sur un calcul de probabilité de présence de parole fréquentielle, il est aisé d'obtenir un indice moyen de présence de parole à l'échelle de la trame à partir des différentes probabilités fréquentielles, de manière à différencier les trames principalement constituées de bruit de celles qui contiennent de la parole utile. On peut par exemple utiliser l'estimateur classique : $P_{speech} (k) = \frac{1}{N} \sum_{l}^{N} p (k, l)$

P_speech (k) étant la probabilité moyenne de parole évaluée au niveau de la trame temporelle courante,
N étant le nombre de bins du spectre, et
p(k,l) étant la probabilité de présence de parole du bin d'indice l de la trame temporelle courante.

Since the noise reduction algorithm is based on a frequency presence probability calculation, it is easy to obtain a mean index of the presence of speech at the frame scale from the different frequency probabilities, so as to differentiate the frames mainly made of noise from those containing useful speech. For example, we can use the classical estimator:

P_{speech} (k) = \frac{1}{NOT} Σ_{l}^{NOT} p (k, l)

P _speech ( k ) being the average probability of speech evaluated at the level of the current time frame,
N being the number of bins in the spectrum, and
p ( k , l ) being the probability of presence of speech of the bin of index l of the current time frame.

Cette variable P_speech (k) est calculée par le bloc 28 et appliquée sur l'entrée n° ② du bloc 24, qui calcule la dureté de débruitage à appliquer pour une trame donnée : $G_{\min} (k) = G_{\min} + (P_{speech} (k) - 1) . Δ_{1} G_{\min} + P_{speech} (k) . Δ_{2} G_{\min}$

G_min (k) étant le gain minimal à appliquer à la trame temporelle courante,
G_min étant une valeur nominale paramétrée de ce gain minimal, et
Δ₁ G_min étant un incrément/décrément apporté à G_min en phase de bruit, et
Δ₂ G_min étant un incrément/décrément apporté à G_min en phase de parole.

This variable P _speech ( k ) is calculated by the block 28 and applied to the input No. ② of the block 24, which calculates the denoising hardness to be applied for a given frame:

{BOY WUT}_{\min} (k) = {BOY WUT}_{\min} + (P_{speech} (k) - 1) . Δ_{1} {BOY WUT}_{\min} + P_{speech} (k) . Δ_{2} {BOY WUT}_{\min}

G _min ( k ) being the minimum gain to be applied to the current time frame,
G _min being a parameterized nominal value of this minimum gain, and
Δ ₁ G _min being an increment / decrement brought to G _min in the noise phase, and
Δ ₂ G _min being an increment / decrement brought to G _min in speech phase.

L'expression ci-dessus met bien en évidence les deux effets complémentaires de l'optimisation présentée, à savoir :

l'augmentation de la dureté de la réduction de bruit d'un facteur Δ₁ G_min en phase de bruit afin de réduire le bruit résiduel, typiquement Δ₁ > 0, par exemple Δ₁ = +6 dB ; et
la diminution de la dureté de la réduction de bruit d'un facteur Δ₂ G_min en phase de parole afin de mieux préserver la voix, typiquement Δ₂ < 0, par exemple Δ₂ = -3 dB.

The above expression clearly highlights the two complementary effects of the presented optimization, namely:

increasing the hardness of the noise reduction by a factor Δ ₁ G _min in the noise phase in order to reduce the residual noise, typically Δ ₁ > 0, for example Δ ₁ = + 6 dB; and
reducing the hardness of the noise reduction by a factor Δ ₂ G _min in the speech phase in order to better preserve the voice, typically Δ ₂ <0, for example Δ ₂ = -3 dB.

Descripteur temporel : détecteur d'activité vocaleTime descriptor: voice activity detector

Dans cette troisième implémentation, un détecteur d'activité vocale ou VAD (bloc 30) est mis à profit pour effectuer le même type de modulation de dureté que dans l'exemple précédent. Un tel détecteur "parfait" délivre un signal binaire (absence vs. présence de parole), et se distingue des systèmes délivrant seulement une probabilité de présence de parole variable entre 0 et 100 % de façon continue ou par pas successifs, qui peuvent introduire des fausses détections importantes dans des environnements bruités.In this third implementation, a voice activity detector or VAD (block 30) is used to perform the same type of hardness modulation as in the previous example. Such a "perfect" detector delivers a binary signal (absence vs. presence of speech), and is distinguished from systems delivering only a probability of presence of variable speech between 0 and 100% continuously or in successive steps, which can introduce false important detections in noisy environments.

Le module de détection d'activité vocale ne prenant que deux valeurs distinctes '0' ou '1', la modulation de la dureté de débruitage sera discrète : $G_{\min} (k) = G_{\min} + VAD (k) . Δ G_{\min}$

G_min (k) étant le gain minimal à appliquer à la trame temporelle courante,
G_min étant une valeur nominale paramétrée dudit gain minimal,
VAD (k) étant la valeur du signal booléen de détection d'activité vocale pour la trame temporelle courante, évalué dans le domaine temporel (bloc 30) et appliqué à l'entrée n° ③ du bloc 24, et
ΔG_min étant l'incrément/décrément apporté à la valeur G_min.

Since the voice activity detection module takes only two distinct values '0' or '1', the modulation of the denoising hardness will be discrete:

{BOY WUT}_{\min} (k) = {BOY WUT}_{\min} + VAD (k) . Δ {BOY WUT}_{\min}

G _min ( k ) being the minimum gain to be applied to the current time frame,
G _min being a parameterized nominal value of said minimum gain,
VAD ( k ) being the value of the Boolean voice activity detection signal for the current time frame, evaluated in the time domain (block 30) and applied to the input No. ③ of the block 24, and
Δ G _min being the increment / decrement brought to the value G _min .

Le détecteur d'activité vocale 30 peut être réalisé de différentes manières, dont a va donner ci-dessous trois exemples d'implémentation.The voice activity detector 30 may be implemented in different ways, of which three examples of implementation will be given below.

Dans un premier exemple, la détection est opérée à partir du signal y(k), d'une manière intrinsèque au signal recueilli par le micro ; une analyse du caractère plus ou moins harmonique de ce signal permet de déterminer la présence d'une activité vocale, car un signal présentant une forte harmonicité peut être considéré, avec une faible marge d'erreur, comme étant un signal de voix, donc correspondant à une présence de parole.In a first example, the detection is performed from the signal y ( k ), intrinsically to the signal collected by the microphone; an analysis of the more or less harmonic nature of this signal makes it possible to determine the presence of a vocal activity, because a signal having a strong harmonicity can be considered, with a small margin of error, as being a voice signal, therefore corresponding to a presence of speech.

Dans un deuxième exemple, le détecteur d'activité vocale 30 fonctionne en réponse au signal produit par une caméra, installée par exemple dans l'habitacle d'un véhicule automobile et orientée de manière que son angle de champ englobe en toutes circonstances la tête du conducteur, considéré comme le locuteur proche. Le signal délivré par la caméra est analysé pour déterminer d'après le mouvement de la bouche et des lèvres si le locuteur parle ou non, comme cela est décrit entre autres dans le EP 2 530 672 A1 (Parrot SA) , auquel on pourra se référer pour de plus amples explications. L'avantage de cette technique d'analyse d'image est de disposer d'une information complémentaire totalement indépendante de l'environnement de bruit acoustique.In a second example, the voice activity detector 30 operates in response to the signal produced by a camera, installed for example in the passenger compartment of a motor vehicle and oriented so that its angle of view encompasses in all circumstances the head of the vehicle. driver, considered to be the close speaker. The signal delivered by the camera is analyzed to determine from the movement of the mouth and lips whether the speaker speaks or not, as described inter alia in the EP 2 530 672 A1 (Parrot SA) , which can be referred to for further explanation. The advantage of this image analysis technique is to have complementary information completely independent of the acoustic noise environment.

Un troisième exemple de capteur utilisable pour la détection d'activité vocale est un capteur physiologique susceptible de détecter certaines vibrations vocales du locuteur qui ne sont pas ou peu corrompues par le bruit environnant. Un tel capteur peut être notamment constitué d'un accéléromètre ou d'un capteur piézoélectrique appliqué contre la joue ou la tempe du locuteur. Il peut être en particulier incorporé au coussinet d'un écouteur d'un ensemble combiné micro/casque, comme cela est décrit dans le EP 2 518 724 A1 (Parrot SA), auquel on pourra se reporter pour plus de détails.A third example of a sensor that can be used for voice activity detection is a physiological sensor capable of detecting certain vocal vibrations of the speaker that are not or only slightly corrupted by the surrounding noise. Such a sensor may especially consist of an accelerometer or a piezoelectric sensor applied against the cheek or the temple of the speaker. It can be in particular incorporated in the earpad pad of a combined microphone / headset assembly, as described in the EP 2 518 724 A1 (Parrot SA), which can be referred to for more details.

En effet, lorsqu'une personne émet un son voisé (c'est-à-dire une composante de parole dont la production s'accompagne d'une vibration des cordes vocales), une vibration se propage depuis les cordes vocales jusqu'au pharynx et à la cavité bucco-nasale, où elle est modulée, amplifiée et articulée. La bouche, le voile du palais, le pharynx, les sinus et les fosses nasales servent ensuite de caisse de résonance à ce son voisé et, leur paroi étant élastique, elles vibrent à leur tour et ces vibrations sont transmises par conduction osseuse interne et sont perceptibles au niveau de la joue et de la tempe.Indeed, when a person makes a voiced sound (that is, a speech component whose production is accompanied by a vibration of the vocal cords), a vibration propagates from the vocal cords to the pharynx and to the bucco-nasal cavity, where it is modulated, amplified and articulated. The mouth, the soft palate, the pharynx, the sinuses and the nasal fossae then serve as a sounding board for this voiced sound and, their wall being elastic, they vibrate in turn and these vibrations are transmitted by internal bone conduction and are perceptible at the cheek and temple.

Ces vibrations au niveau de la joue et de la tempe présentent la caractéristique d'être, par nature, très peu corrompues par le bruit environnant. En effet, en présence de bruits extérieurs, même importants, les tissus de la joue et de la tempe ne vibrent quasiment pas, et ceci quelle que soit la composition spectrale du bruit extérieur. Un capteur physiologique qui recueille ces vibrations vocales dépourvues de bruit donne un signal représentatif de la présence ou de l'absence de sons voisés émis par le locuteur, permettant donc de discriminer très bien les phases de parole et les phases de silence du locuteur.These vibrations in the cheek and temple have the characteristic of being, by nature, very little corrupted by the surrounding noise. Indeed, in the presence of external noise, even important, the tissues of the cheek and the temple do not vibrate almost, and this regardless of the spectral composition of the outside noise. A physiological sensor which collects these noise-free vocal vibrations gives a signal representative of the presence or absence of voiced sounds emitted by the speaker, thus making it possible to discriminate very clearly the speech phases and the speaker's silence phases.

Implementation variant of the OM-LSA denoising algorithm

En variante ou en complément de ce qui précède, le gain spectral G_OMLSA - calculé dans le domaine fréquentiel pour chaque bin - peut être modulé de façon indirecte, en pondérant la probabilité de présence de parole fréquentielle p(k,l) par un indicateur global temporel observé au niveau de la trame (et non plus d'un simple bin fréquentiel particulier).As a variant or in addition to the above, the spectral gain G _OMLSA - calculated in the frequency domain for each bin - can be modulated indirectly, by weighting the probability of presence of frequency speech p ( k , l ) by an indicator global time observed at the level of the frame (and no longer a single particular frequency bin).

Dans ce cas, si une trame de bruit seul est détectée, on peut avantageusement considérer que chaque probabilité fréquentielle de parole devrait être nulle, et la probabilité fréquentielle locale pourra être pondérée par une donnée globale, cette donnée globale permettant de faire une déduction sur le cas réel rencontré à l'échelle de la trame (phase de parole/phase de bruit seul), que la seule donnée dans le domaine fréquentiel n'autorise pas à formuler ; en présence de bruit seul, on pourra se ramener à un débruitage uniforme, évitant toute musicalité du bruit, qui gardera son "grain" d'origine.In this case, if a single noise frame is detected, it can advantageously be considered that each frequency probability of speech should be zero, and the local frequency probability can be weighted by a global datum, this global datum making it possible to make a deduction on the real case encountered at the frame scale (speech / phase phase noise alone), that the only data in the frequency domain does not allow to formulate; in the presence of noise alone, we can reduce ourselves to a uniform denoising, avoiding any musicality of the noise, which will keep its "grain" of origin.

En d'autres termes, la probabilité de présence de parole initialement fréquentielle sera pondérée par une probabilité de présence globale de parole à l'échelle de la trame : on s'efforcera alors de débruiter de manière homogène l'ensemble de la trame dans un cas d'absence de parole (débruiter uniformément quand la parole est absente).In other words, the probability of presence of speech initially frequency will be weighted by a probability of global presence of speech at the scale of the frame: one will then strive to denoise in a homogeneous way the whole of the frame in a absence of speech (denoise uniformly when speech is absent).

En effet, comme on l'a exposé plus haut, de présence de parole P_speech (k) (calculée comme la moyenne arithmétique des probabilités fréquentielles de présence de parole) est un indicateur plutôt fiable de la présence de parole à l'échelle de la trame. On peut alors envisager de modifier l'expression conventionnelle du calcul du gain OM-LSA, à savoir : $G_{OMLSA} (k, l) = {\{G (k, l)\}}^{p (k, l)} . G_{\min}^{1 - p (k, l)}$

en pondérant la probabilité fréquentielle de présence de parole par une donnée globale p_glob (k) de présence de parole évaluée au niveau de la trame :

G_{OMLSA} (k, l) = {\{G (k, l)\}}^{p (k, l) . p_{glob} (k)} . G_{\min}^{1 - p (k, l) . p_{glob} (k)}

G_Oivf_LSA (k,l) étant le gain spectral à appliquer au bin d'indice l de la trame temporelle courante,
G (k,l) étant un gain de débruitage sous-optimal à appliquer au bin d'indice l,
p(k,l) étant la probabilité de présence de parole du bin d'indice l de la trame temporelle courante,
p_glob (k) étant la probabilité globale et seuillée de parole, évaluée au niveau de la trame temporelle courante, et
G_min étant une valeur nominale paramétrée du gain spectral.

Indeed, as explained above, the presence of speech P _speech ( k ) (calculated as the arithmetic mean of the frequency probabilities of presence of speech) is a rather reliable indicator of the presence of speech on the scale of speech. the frame. One can then consider modifying the conventional expression of the OM-LSA gain calculation, namely:

{BOY WUT}_{OMLSA} (k, l) = {\{BOY WUT (k, l)\}}^{p (k, l)} . {BOY WUT}_{\min}^{1 - p (k, l)}

by weighting the frequency probability of presence of speech by a global data item p _glob ( k ) of speech presence evaluated at the level of the frame:

{BOY WUT}_{OMLSA} (k, l) = {\{BOY WUT (k, l)\}}^{p (k, l) . p_{glob} (k)} . {BOY WUT}_{\min}^{1 - p (k, l) . p_{glob} (k)}

G _Oi vf _LSA ( k , l ) being the spectral gain to be applied to the bin of index l of the current time frame,
G (k, l) being a suboptimal denoising gain to be applied to bin of index l,
p ( k , l ) being the probability of presence of speech of the bin of index l of the current time frame,
p _glob ( k ) being the global and thresholded probability of speech, evaluated at the level of the current time frame, and
G _min being a parameterized nominal value of the spectral gain.

La donnée globale p_glob (k) au niveau de la trame temporelle peut notamment être évaluée de la manière suivante : $p_{glob} (k) = \frac{1}{P_{seuil}} . \max \{P_{speech} (k); P_{seuil}\}$

P_{speech} (k) = \frac{1}{N} \sum_{l}^{N} p (k, l)

P_seuil étant une valeur de seuil de la probabilité globale de parole, et
N étant le nombre de bins dans le spectre.

The global datum p _glob ( k ) at the time frame can be evaluated in particular as follows:

p_{glob} (k) = \frac{1}{P_{threshold}} . \max \{P_{speech} (k); P_{threshold}\}

P_{speech} (k) = \frac{1}{NOT} Σ_{l}^{NOT} p (k, l)

P _threshold being a _threshold value of the overall probability of speech, and
N being the number of bins in the spectrum.

Ceci revient à substituer dans l'expression conventionnelle la probabilité fréquentielle p(k,l) par une probabilité combinée p_combinée (k,l) qui intègre une pondération par la donnée globale p _glob (k), non fréquentielle, évaluée au niveau de la trame temporelle en présence de parole : $G_{OMLSA} (k, l) = {\{G (k, l)\}}^{p_{combinée} (k, l)} . G_{\min}^{1 - p_{combinée} (k, l)}$

p_{combinée} (k, l) = p (k, l) . p_{glob} (k)

This amounts to replacing in the conventional expression the frequency probability p ( k , l ) by a combined _combined probability p ( k , l ) which integrates a weighting by the global data p _{glo b} (k ), non-frequency, evaluated at the level of the time frame in the presence of speech:

{BOY WUT}_{OMLSA} (k, l) = {\{BOY WUT (k, l)\}}^{p_{combined} (k, l)} . {BOY WUT}_{\min}^{1 - p_{combined} (k, l)}

p_{combined} (k, l) = p (k, l) . p_{glob} (k)

En d'autres termes :

en présence de parole au niveau de la trame, c'est-à-dire si P_speech (k) > P_seuil , l'expression conventionnelle du calcul du gain OM-LSA reste inchangée ;
en l'absence de parole au niveau de la trame, c'est-à-dire si P_speech (k) < P _seui/, les probabilités fréquentielles p(k,l) seront en revanche pondérées par la probabilité globale p_glob (k) faible, ce qui aura pour impact d'uniformiser les probabilités en diminuant leurs valeurs ;
dans le cas asymptotique particulier P_speech (k) = 0, toutes les probabilités seront nulles et le débruitage sera totalement uniforme.

In other words:

in the presence of speech at the level of the frame, that is to say if P _speech ( k )> P _threshold , the conventional expression of the calculation of the gain OM-LSA remains unchanged;
in the absence of speech at the level of the frame, that is to say if P _speech ( k ) < P _{threshold /} , the frequency probabilities p ( k , l ) will instead be weighted by the global probability p _glob ( k ) low, which will have the effect of standardizing probabilities by decreasing their values;
in the particular asymptotic case P _speech ( k ) = 0, all the probabilities will be zero and denoising will be totally uniform.

L'évaluation de la donnée globale p_glob (k) est schématisée sur la Figure 2 par le bloc 32, qui reçoit en entrée les données P_seuil (valeur de seuil paramétrable) et P_speech (k,l) (valeur elle-même calculée par le bloc 28, comme décrit plus haut), et délivre en sortie la valeur p_glob (k) qui est appliquée à l'entrée ④ du bloc 24.The evaluation of the global data p _glob ( k ) is schematized on the Figure 2 by the block 32, which receives as input the data P _threshold (parameterizable threshold value) and P _speech ( k , l ) (value itself calculated by the block 28, as described above), and outputs the value p _glob ( k ) which is applied to the input ④ of the block 24.

Ici encore, on utilise une donnée globale calculée au niveau de la trame pour affiner le calcul du gain fréquentiel de débruitage, et ceci en fonction du cas de figure rencontré (absence/présence de parole). En particulier, la donnée globale permet d'estimer la situation réelle rencontrée à l'échelle de la trame (phase de parole vs. phase de bruit seul), ce que la seule donnée fréquentielle ne permettrait pas de formuler. Et en présence de bruit seul, on peut se ramener à un débruitage uniforme, solution idéale car le bruit résiduel perçu ne sera alors jamais musical.Here again, a global data item calculated at the level of the frame is used to refine the calculation of the frequency gain of denoising, and this as a function of the case encountered (absence / presence of speech). In particular, the global data makes it possible to estimate the actual situation encountered at the of the frame (speech phase vs. noise phase alone), which the only frequency data would not allow to formulate. And in the presence of noise alone, we can reduce to a uniform denoising, ideal solution because the perceived residual noise will never be musical.

Results obtained by the algorithm of the invention

Comme on vient de l'exposer, l'invention repose sur la mise en évidence de ce que le compromis débruitage/dégradation du signal repose sur un calcul de gain spectral (fonction d'un paramètre scalaire de gain minimal et d'une probabilité de présence de parole) dont le modèle est sous-optimal, et propose une formule impliquant une modulation temporelle de ces éléments de calcul du gain spectral, qui deviennent fonction de descripteurs temporels pertinents du signal de parole bruitée.As has just been explained, the invention is based on the demonstration that the noise denoising / degradation compromise is based on a spectral gain calculation (a function of a minimum gain scalar parameter and a probability of presence of speech) whose model is suboptimal, and proposes a formula involving a temporal modulation of these elements of calculation of the spectral gain, which become a function of relevant temporal descriptors of the noisy speech signal.

L'invention repose sur l'exploitation d'une donnée globale pour traiter de manière plus pertinente et adaptée chaque bande de fréquence, la dureté de débruitage étant rendue variable en fonction de la présence de parole sur une trame (on débruite plus quand le risque d'avoir une contrepartie est faible).The invention is based on the exploitation of a global datum in order to process each frequency band in a more relevant and adapted manner, the denoising hardness being made variable as a function of the presence of speech on a frame (the noise is no longer disconnected when the risk to have a counterpart is weak).

Dans l'algorithme OM-LSA conventionnel, chaque bande de fréquence est traitée de manière indépendante, et pour une fréquence donnée on n'intègre pas la connaissance a priori des autres bandes. Or, une analyse plus large qui observe l'ensemble de la trame pour calculer un indicateur global caractéristique de la trame (ici, un indicateur de présence de parole capable de discriminer même grossièrement phase de bruit seul et phase de parole) est un moyen utile et efficace pour affiner le traitement à l'échelle de la bande de fréquences.In the conventional OM-LSA algorithm, each frequency band is treated independently, and for a given frequency is not integrated a priori knowledge of other bands. However, a broader analysis that observes the entire frame to calculate a global indicator characteristic of the frame (here, a speech presence indicator capable of discriminating even roughly noise phase alone and speech phase) is a useful means and effective in refining the processing at the frequency band scale.

Concrètement, dans un algorithme OM-LSA conventionnel, le gain de débruitage est généralement ajusté à une valeur de compromis, typiquement de l'ordre de 14 dB.Specifically, in a conventional OM-LSA algorithm, the denoising gain is generally adjusted to a compromise value, typically of the order of 14 dB.

La mise en oeuvre de l'invention permet d'ajuster ce gain dynamiquement à une valeur variant entre 8 dB (en présence de parole) et 17 dB (en présence de bruit seul). La réduction de bruit est ainsi beaucoup plus énergique, et rend le bruit pratiquement imperceptible (et en tout état de cause non musical) en l'absence de parole dans la majeure partie des situations couramment rencontrées. Et même en présence de parole, le débruitage ne modifie pas la tonalité de la voix, dont le rendu reste naturel.The implementation of the invention makes it possible to adjust this gain dynamically to a value varying between 8 dB (in the presence of speech) and 17 dB (in the presence of noise alone). The noise reduction is thus much more energetic, and makes the noise virtually imperceptible (and in any case non-musical) in the absence of speech in most situations commonly encountered. And even in the presence of speech, denoising does not change the tone of the voice, whose rendering remains natural.

Claims

A method for denoising an audio signal by applying an algorithm with a variable spectral gain, which is function of a speech presence probability, including the following successive steps:
a) generating (10) successive time frames (y(k)) of the digitized noisy audio signal (y(n)) ;

b) applying a Fourier transform (12) to the frames generated at step a), so as to produce for each signal time frame a signal spectrum (Y(k, l)) with a plurality of predetermined frequency bands;

c) in the frequency domain:
c1) estimating (18), for each frequency band of each current time frame, a speech presence probability (p(k, l)) ;

c3) calculating (16) a spectral gain (G_omLsA(k,l)), specific to each frequency band of each current time frame, as a function of: i) an estimation of the noise energy in each frequency band, ii) the speech presence probability estimated at step c1), and iii) a scalar value of minimum gain (G_min) representative of a parameter of hardness of the denoising;

c4) performing (14) a selective noise reduction by applying to each frequency band of the gain calculated at step c3);

d) applying an inverse Fourier transform (20) to the signal spectrum (X̂(k,l)) consisted of the frequency bands produced at step c4), so as to deliver for each spectrum a time frame of denoised signal; and

e) reconstructing (22) a denoised audio signal from the time frames delivered at step d),
the method being characterized in that:
- said scalar value of minimum gain (G_min) is a value (G_min(k)) that is dynamically modulatable at each successive time frame (y(k)) ; and

- the method further includes, previously to step c3) of calculating the spectral gain, a step of:
c2) calculating (24), for the current time frame (y(k)), said modulatable value (G_min(k)), as a function of a global variable (SNR_y(k)); (P_speech(k)); (VAD(k)) observed at the current time frame for all the frequency bands; and

- said calculation of step c2) comprises apply, for the current time frame, an increment/decrement (ΔG_min(k); Δ ₁ G_min ,Δ₂ G_min; ΔG_min) added to a parameterized nominal value (G_min) of said minimum gain.
The method of claim 1, wherein said global variable is a signal-to-noise ratio (SNR_y(k)) of the current time frame, evaluated (26) in the time domain.
The method of claim 2, wherein the scalar value of minimum gain is calculated at step c2) by application of the relation: $G_{\min} (k) = G_{\min} + Δ G_{\min} ({SNR}_{y} (k))$

k being the index of the current time frame,

G_min(k) being the minimum gain to be applied to the current time frame,

G_min being said parameterized nominal value of the minimum gain,

ΔG_min(k) being said increment/decrement added to G_min , and

SNR_y(k) being the signal-to-noise ratio of the current time frame.
The method of claim 1, wherein said global variable is an average speech probability (P_speech(k)), evaluated (28) at the current time frame.
The method of claim 4, wherein the scalar value of minimum gain is calculated at step c2) by application of the relation: $G_{\min} (k) = G_{\min} + (P_{speech} (k) - 1) . Δ_{1} G_{\min} + P_{speech} (k) . Δ_{2} G_{\min}$

k being the index of the current time frame,

G_min(k) being the minimum gain to be applied to the current time frame,

G_min being said parameterized nominal value of the minimum gain,

P_speech(k) being the average speech probability evaluated at the current time frame,

Δ₁ G_min being said increment/decrement added to G_min in phase of noise, and

Δ ₂ G_min being said increment/decrement added to G_min in phase of speech.
The method of claim 4, wherein the average speech probability is evaluated at the current time frame by application of the relation: $P_{speech} (k) = \frac{1}{N} \sum_{l}^{N} p (k, l)$

l being the index of the frequency band,

N being the number of frequency bands in the spectrum, and p(k,l) being the speech presence probability of the frequency band of index l of the current time frame.
The method of claim 1, wherein said global variable is a Boolean signal of voice activity detection (VAD(k)) for the current time frame, evaluated (30) in the time domain by analysis of the time frame and/or by means of an external detector.
The method of claim 7, wherein the scalar value of minimum gain is calculated at step c2) by application of the relation: $G_{\min} (k) = G_{\min} + VAD (k) . Δ G_{\min}$

k being the index of the current time frame,

G_min(k) being the minimum gain to be applied to the current time frame,

G_min being said parameterized nominal value of the minimum gain,

VAD(k) being the value of the Boolean signal of voice activity detection for the current time frame, and

ΔG_min being said increment/decrement added to G_min .