EP1116216A1

EP1116216A1 - Method and device for detecting voice activity

Info

Publication number: EP1116216A1
Application number: EP00956596A
Authority: EP
Inventors: Stéphane LUBIARZ; Edouard Hinard; François CAPMAN; Philip Lockwood
Original assignee: Matra Nortel Communications SAS
Current assignee: Nortel Networks France SAS
Priority date: 1999-08-04
Filing date: 2000-08-02
Publication date: 2001-07-18
Also published as: WO2001011605A1; AU6848400A; FR2797343B1; FR2797343A1; US7003452B1

Abstract

The invention concerns a method for detecting voice activity in a digital speech signal, in at least a frequency band, for example by means of a detecting automaton whereof the status is controlled on the basis of an energy analysis of the signal. The control of said automaton, or more generally the determination of voice activity, comprises a comparison, in the frequency band, of two different versions of the speech signal one of which at least is a noise-corrected version.

Description

VOICE ACTIVITY DETECTION METHOD AND DEVICE

The present invention relates to digital techniques for processing speech signals. It relates more particularly to techniques using voice activity detection in order to carry out differentiated processing depending on whether the signal supports voice activity or not. The digital techniques in question come from various fields: speech coding for transmission or storage, speech recognition, noise reduction, echo cancellation ...

The main difficulty with voice activity detection methods is the distinction between voice activity and the noise that accompanies the speech signal.

Document WO99 / 1 737 describes a method for detecting voice activity in a digital speech signal processed by successive frames, in which a priori denoising of the speech signal of each frame is carried out on the basis of noise estimates obtained during the processing of one or more previous frames, and the energy variations of the noise-suppressed signal are analyzed a priori to detect a degree of vocal activity in the frame. The fact of detecting voice activity on the basis of an a priori denoised signal appreciably improves the performance of this detection when the surrounding noise is relatively high.

In the methods usually used to detect vocal activity, the variations in signal energy (direct or noise-suppressed) are analyzed in relation to a long-term average of the energy of this signal, a relative increase in instantaneous energy. suggesting the appearance of vocal activity.

An object of the present invention is to propose another type of analysis allowing detection of vocal activity robust to the noise which can accompany the speech signal.

According to the invention, there is proposed a method for detecting voice activity in a digital speech signal in at least one frequency band, according to which the voice activity is detected on the basis of an analysis comprising a comparison, in said frequency band, of two different versions of the speech signal, at least one of which is a denoised version obtained by taking into account estimates of the noise included in the signal. This process can be performed over the entire frequency band of the signal, or in sub-bands, depending on the needs of the application using the voice activity detection.

Speech activity can be detected binary for each band, or measured by a continuously varying parameter that can result from comparing the two different versions of the speech signal.

The comparison typically relates to respective energies, evaluated in said frequency band, of the two different versions of the speech signal, or to a monotonic function of these energies. Another aspect of the present invention relates to a device for detecting voice activity in a speech signal, comprising signal processing means arranged to implement a method as defined above.

The invention also relates to a computer program, loadable in a memory associated with a processor, and comprising portions of code for implementing a method as defined above during the execution of said program. by the processor, as well as to a computer medium, on which such a program is recorded.

Other features and advantages of the present invention will appear in the following description of nonlimiting exemplary embodiments, with reference to the appended drawings, in which:

- Figure 1 is a block diagram of a signal processing chain using a voice activity detector according to the invention;

- Figure 2 is a block diagram of an example of voice activity detector according to the invention;

FIGS. 3 and 4 are flow diagrams of signal processing operations carried out in the detector of FIG. 2,

- Figure 5 is a graph showing an example of evolution of energies calculated in the detector of Figure 2 and illustrating the principle of voice activity detection;

- Figure 6 is a diagram of a detection automaton implemented in the detector of Figure 2;

- Figure 7 is a block diagram of another embodiment of a voice activity detector according to the invention; - Figure 8 is a flow chart of signal processing operations performed in the detector of Figure 7; - Figure 9 is a graph of a function used in the operations of Figure 8

The device of FIG. 1 processes a digital speech signal s The signal processing chain represented produces decisions of voice activity δ _n . usable in a manner known per se by application units, not shown, providing functions such as speech coding, speech recognition, noise reduction, echo cancellation Decisions δ _n . can include a frequency resolution (index j), which enriches applications operating in the frequency domain A windowing module 10 puts the signal s in the form of successive windows or frames of index n, each consisting of a number N of digital signal samples Conventionally, these frames may have mutual overlaps In the following description, it will be considered, without this being limiting, that the frames consist of N = 256 samples at a frequency d 8 kHz F _e sampling, with Hamming weighting in each window, and 50% overlap between consecutive windows

The signal frame is transformed in the frequency domain by a module 11 applying a conventional fast Fouπer transform (TFR) algorithm to calculate the module of the signal spectrum. The module 11 then delivers a set of N = 256 frequency components of the signal. speech, noted S _nf , where n denotes the number of the current frame, and f a frequency of the discrete spectrum Due to the properties of digital signals in the frequency domain, only the N / 2 = 128 first samples are used

To calculate the estimates of the noise contained in the signal s, the frequency resolution available at the output of the fast Fouπer transform is not used, but a lower resolution, determined by a number I of frequency sub-bands covering the band [0, F _e / 2] of the signal Each sub-band i (1 <i <I) extends between a lower frequency f (ι-1) and a higher frequency f (ι), with f (0) = 0, and f (l) = F _e / 2 This sub-banding can be uniform (f (ι) -f (ι-1) = F _e / 2I) It can also be non-uniform (for example according to a barks scale) A module 12 calculates the respective means of the spectral components S _nf of the speech signal by sub-bands, for example by a uniform weighting such that s - ¹ y 'nf

"^'' ^f (') -f), _e [f (,> .f ([

This averaging decreases the fluctuations between the sub-bands by averaging the contributions of the noise in these sub-bands, which will decrease the variance of the noise estimator. In addition, this averaging makes it possible to reduce the complexity of the system.

The averaged spectral components S _n are addressed to a voice activity detection module 15 and to a noise estimation module 16 We denote B _n , the long-term estimate of the noise component produced by the module 16 relative to frame n and subband i

These long-term estimates B _n , can for example be obtained in the manner described in WO99 / 14737 One can also use a simple smoothing by means of an exponential window defined by a forgetting factor λ _B

B n, ι = λ _R BB n— 1, ι + (1-λ _R °)> S n, ι with λ _B equal to 1 if the voice activity detector 15 indicates that the sub-band i carries an activity voice, and equal to a value between 0 and 1 otherwise

Of course, it is possible to use other long-term estimates representative of the noise component included in the speech signal, these estimates can represent a long-term average, or even a minimum of the component S _n , over a sliding window long enough

FIGS. 2 to 6 illustrate a first embodiment of the voice activity detector 15 A denoising module 18 performs, for each frame n and each sub-band i, the operations corresponding to steps 180 to 187 of FIG. 3, to produce two noisy versions Êp- | _n,, EP2 _f n, ι d ^u speech signal This denoising is operated by nonlinear spectral subtraction The first version Ep _{η n,} is denoised so as to be not less than, in the spectral domain, a β1 fraction, of the long-term estimate B _n . _τ ι, The

second version Êp2 _ιn, ι ^is denoised so as not to be less, in the spectral domain, than a fraction β2, of the long-term estimate B _n . _τ - | , The quantity τ1 is a delay expressed in number of frames, which can be fixed (by example τ1 = 1) or variable. It is all the more weak that one is confident in the detection of voice activity. The fractions β1 _j and β2 _j (such as β1 _j > β2 _; ) can be dependent or independent of the sub-band i. Preferred values correspond for β1 _j to an attenuation of 10 dB, and for β2 _j to an attenuation of 60 dB, ie β1 _s ≈ 0.3 and β2 _s ≈ 0.001.

In step 180, the module 18 calculates, with the resolution of the sub-bands i, the frequency response Hp _nj of the a priori denoising filter, according to:

. . ^ n, i ^{~ α} n-τ1, i- ° n-τ1, i Pn, i = - ^ύ n-τ2, i where τ2 is a positive integer delay or zero and α ' _ni is a noise overestimation coefficient. This overestimation coefficient α ' _n - _i can be dependent or independent of the frame index n and / or of the subband index i. In a preferred embodiment, it depends on both n and i, and it is determined as described in document WO99 / 14737. A first denoising is carried out in step 181: Êp _n = Hp _{n {} .S _nt . In steps 182 to 184, the spectral components Êp-j _ns are calculated according to Êpι _ni = max Êp _n J; β1j B _n _ _τ ij), and aux

steps 182 to 184, the spectral components Ëp2 _{> n,} i ^are calculated according to

Êp2 _ιn, i = max (Êp _{n) i} ; β2 _i .B _n _ _{τ1 ι} j).

The voice activity detector 15 in FIG. 2 comprises a module 19 which calculates the energies of the noisy versions of the signal Êp-j _n and Êp2 _nj , respectively included in m frequency bands designated by the index j (1 <j < m, m> 1). This resolution can be the same as that of the sub-bands defined by the module 12 (index i), or a less fine resolution which can go up to the whole of the useful band [0, F _e / 2] of the signal ( case m = 1). For example, the module 12 can define 1 = 16 uniform sub-bands of the band [0, F _e / 2], and the module 19 can keep m = 3 wider bands, each index band j covering the sub-bands of index i going from imin (j) to imax (j), with imin (1) = 1, imin (j + 1) = imax (j) + 1 for 1 ≤j <m, and imax (m) = I. In step 190 (FIG. 3), the module 19 calculates the energies per band: imax (j)

El, n, J = Σ [f (i) -f (i-1)] - Ê _{Pl n> |} i = imin (j) imax (j) E _2, n _{, j} = ∑ [f (i) -f (iD] .Êp ₂ ² _nιi i = iminO)

A module 20 of the voice activity detector 15 performs a time smoothing of the energies E _{1 n} = and E _{2 n} -, for each of the bands of index j, which corresponds to steps 200 to 205 of FIG. 4. Smoothing of these two energies is carried out by means of a smoothing window determined by comparing the energy E _{2 nj} of the most denoised version with its previously calculated smoothed energy E2 _ιn -i _{, j} . ^or to ^a value of the order of this smoothed energy E2 nl _{, j} (tests 200 and 201). This smoothing window can be an exponential window defined by a forgetting factor λ between 0 and 1. This forgetting factor λ can take three values: one λ _r very close to 0 (for example λ _r = 0 ) chosen in step 202 if E _{2 n} -, ≤ E2, nl, j; ^Has second λ _q very close to 1

(for example λ _q = 0.99999) chosen in step 203 if E _{2 n} :> Δ. E _2ι n-ι _, j. ^Δ © being a coefficient greater than 1; and the third λ _p between 0 and λ _q (for example λ _p = 0.98) chosen in step 204 if E2 _ιn -i _, j <E _{2 nj} <Δ. E2 _ιn -ι _{, j} - The exponential smoothing with the forgetting factor λ is then conventionally carried out in step 205 according to:

Ëι _, n _, j = λ. Ë _{1 ιn} . ₁ + (1-λ) .E _{1 nj}

Ë2 _, n _, j ⁼ λ. Ê _2ιn - ₁ + (1-λ) .E _{2 nj} An example of variation over time of the energies E _{1 n} =, E _{2 n} : and of the smoothed energies E-) _n and E _ι n is shown in Figure 5 We see that we arrive at a good follow-up of the smoothed energies when we determine the forget factor on the basis of the variations of the energy E _{2 n} ι corresponding to the most denoised version of the signal. The forgetting factor λ _p takes into account increases in the level of background noise, the energy decreases being followed by the forgetting factor λ _r . The forgetting factor λ _q very close to 1 means that the smoothed energies do not follow the increases in sudden energies due to speech. The factor λ _q remains, however, slightly less than 1 to avoid errors caused by an increase in background noise which can occur during a fairly long period of speech.

The voice activity detection automaton is controlled in particular by a parameter resulting from a comparison of the energies E _{1 n} = and E _{2 n} =. This parameter can in particular be the ratio d _n = = E _{1 n} ; _{2 n} :. It can be seen in FIG. 5 that this ratio d _n = makes it possible to correctly detect the speech phases

(represented by hatching).

The detection automaton control can also use other parameters, such as a parameter linked to the signal-to-noise ratio: snr _n : = E _{1 n} : / Eι _n , which amounts to taking into account a comparison between the energies E _{1 n} = and E _{1 ιn} . The module 21 for controlling the automata relating to the different index bands j calculates the parameters d _n : and snr _n j in step

210, then determines the state of the automata. The new state δ _n : of the automaton relating to the band j depends on the previous state δ -, .., -. , d _n : and snr _n :, for example as shown in the diagram in Figure 6.

Four states are possible: δ: = 0 detects silence, or absence of speech; δ: = 2 detects the presence of voice activity; and the states δ: = 1 and δ: = 3 are intermediate states of ascent and descent. When the automaton is in the state of silence (δ _p ^: = 0), it remains there if d _nj exceeds a first threshold α1 ι, and it goes into the state of ascent otherwise. In the rising state (δ _n _., = = 1), it returns to the silent state if d _n : exceeds a second threshold α2 =; and it goes into the speaking state otherwise. When the automaton is in the speaking state (δ _n _.,: = 2), it remains there if snr _n j exceeds a third threshold α3 =, and it goes into the descending state otherwise. In the descent state (δ -, .., = = 3), the automaton returns to the speech state if snr _n : exceeds a fourth threshold α4 =, and it returns to the state of silence in the opposite case. The thresholds α1 α.2 ;, α3ι and α4 = can be optimized separately for each of the frequency bands j. It is also possible for the module 21 to cause the automata for different bands.

In particular, it can force the automata relating to each of the sub-bands into the speech state as soon as one of them is in the speech state. In this case, the output of the voice activity detector 15 concerns the entire signal band.

The two appendices to the present description show a source code in C ++ language, with a representation of the data in fixed point, corresponding to an implementation of the example of method of detection of voice activity described above. To make the detector, one possibility is to translate this source code into executable code, to save it in a program memory associated with an appropriate signal processing processor, and to have it executed by this processor on the input signals. of the detector. The function a_priori_signal_power presented in appendix 1 corresponds to the operations incumbent on the modules 18 and 19 of the voice activity detector 15 of figure 2. The function voice_activity_detector presented in appendix 2 corresponds to the operations incumbent on modules 20 and 21 of this detector.

In the particular example in the appendices, the following parameters have been used: τ1 = 1; τ2 = 0; β1 _j = 0.3; β2 _j = 0.001; m = 3; Δ = 4.953; λ _p = 0.98; λ _q = 0.99999; λ _r = 0; α1 _j = ct2 _j = α = 1, 221; α3 _j = 1, 649. The

Table I below gives the correspondences between the notations used in the previous description and in the drawings and those used in the appendix.

TABLE I

In the variant embodiment illustrated in FIG. 7, the denoising module 25 of the voice activity detector 15 delivers a single denoised version Êp _n , of the speech signal, so that the module 26 calculates the energy

E _{2 n} : for each band j. The other version whose module 26 calculates the energy is directly represented by the non-denoised samples S _nj .

As before, various denoising methods can be - in applied by the module 25. In the example illustrated by steps 250 to 256 of FIG. 8, the denoising is operated by non-linear spectral subtraction with a noise overestimation coefficient dependent on a quantity p related to the ratio signal-to-noise. In steps 250 to 252, a preliminary denoising is carried out for each sub-band of index i according to:

S _n , i = max (s _nιi - ^. B _n _ _{1 | i} ; β ^. B _n _ _1ι i), the preliminary overestimation coefficient being for example α = 2, and the fraction β possibly corresponding to noise attenuation of the order of 10 dB. The quantity p is taken equal to the ratio S ' _nj / S _nj in step 253. The overestimation factor f (p) varies non-linearly with the quantity p, for example as shown in FIG. 9. For the p values closest to 0 (p <p.), the signal-to-noise ratio is low, and an overestimation factor f (p) = 2 can be taken. For the highest values of p ( p ₂ <p <1), the noise is low and does not need to be overestimated (f (p) = 1). Between p ₁ and p ₂ , f (p) decreases from 2 to 1, for example linearly. The actual denoising, providing the Êp _nj version, is carried out in steps 254 to

256:

Êp _nj = max (s _nιi - f (p) .B _n _ _{1] i} ; β.B _n _ _1ι i).

The voice activity detector 15 considered with reference to FIG. 7 uses, in each frequency band of index j (and / or in full band), a detection automaton with two states, silence or speech. The energies E _{1 n} : and

E _{2 n} : calculated by module 26 are respectively those contained in the components S _n ; of the speech signal and those contained in the noisy components Êp _nj calculated on the different bands as indicated in step 260 of FIG. 8. The comparison of the two different versions of the speech signal relates to respective differences between the energies E _{1 n} = and E _{2 n} j and a lower bound of the energy E _{2 n} : of the denoised version.

This lower bound E _{2min j} can in particular correspond to a minimum value, on a sliding window, of the energy E _{2 n} ; of the denoised version of the speech signal in the frequency band considered. In this case, a module 27 stores in a first-in-first-out (FIFO) type memory the L most recent values of the energy E _{2 n} . of the denoised signal in each band j, on a sliding window representing for example of the order of 20 frames, and delivers the minimum energies E2 _m , n = min E2 _ιn _k _{, j}

on this window (step 270 of FIG. 8) In each band, this minimum energy E 2 _min , serves as a minor for the module 28 for controlling the automaton

detection, which uses a measurement M. given by M _j (step

280)

The automaton can be a simple binary automaton using a threshold A., possibly depending on the band considered if M> A., the output bit δ _n . of the detector represents a state of silence for the band j, and if M. <A., it represents a state of speech As a variant, the module 28 could deliver a non-binary measurement of the vocal activity, represented by a decreasing function of M. Alternatively, the minor E _2mιn . used in step 280 could be calculated using an exponential window, with a forgetting factor II could also be represented by the energy on the band j of the quantity β B _n _- | , serving as a floor in denoising by spectral subtraction

In the foregoing, the analysis carried out to decide on the presence or absence of vocal activity relates directly to energies of different versions of the speech signal. Of course, the comparisons could relate to a monotonic function of these energies, for example a logarithm, or on a quantity having a behavior analogous to energies according to the vocal activity (for example the power) ANNEX 1

/ * _*** * _* ********* _** ** _** * _*** * _* ************** _* ****** _{* ****} * _* ****** _**** ** _* ***** description

* NSS module:

* signal power before VAD

*

************************************************** ***************** ****** /

/ * *

* included files

* * / tinclude <assert.h>

#include "private.h"

/ * *

* private

* * /

Word32 power (ordlβ module, Wordlβ beta, Wordlβ thd, Wordlβ val);

/ * a_priori_signal_power * / void a_priori_signal_power

/ * IN * / Wordlβ * E, ordlβ * internal_state, Wordlβ * max_noise, W ordlβ * long_term_noise, ordlβ * frequential_scale,

/ * IN & OUT * / Wordlβ * alpha,

/ * OUT * / ord32 * P1, ord32 * P2

)

{int vad; for (vad = 0; vad <param. ad_number; vad ++) {int start = param. vads [vad]. first_subband_for_power; int stop = param. ads [vad]. last_subband; int subband; int uniform_subband; uniform subband = 1; for (subband ≈ start; subband <= stop; subband ++) if (param. subband_size [subband]! = param. subband size [start]) uniform_subband = 0;

PI [vad] = 0; move32 (); P2 [vad] = 0; move32 (); test(); if (sub (internal_state [vad], NOISE) == 0) {for (subband = start; subband <= stop; subband ++) {

Word32 pwr; ordlβ shift;

Wordlβ module;

Wordlβ alpha_long_term; alpha_long_term = shr (max_noise [subband], 2); movelβO; test(); test(); if (su (alpha_long_term, long_term_noise [subband])> = 0) {alpha [subband] = 0x7fff; movelβO; alpha_long_term = long_term_noise [subband]; movelβO; } else if (sub (max_noise [subband], long_term_noise [subban d]) <0) {alpha [subband] = 0x2000; movelβO; alpha_long_term ≈ shr (long_term_noise [subband], 2); mo vel6 ()

} else {alpha [subband] = div_s (alpha_long_term, long_term_noi se [subband]); movelβO; } module = sub (E [subband], shl (alpha long_term, 2)); movel

if (uniform_subband) {shift = shl (frequential_scale [subband], 1); movelβO; } else {shift = add (param. subband_shift [subband], shl (frequen tial_scale [subband], 1)); movelβO; } pwr = power (module, param. beta_a_prioril, long_term_nois e [subband], long_term_noise [subband]); pwr = L_shr (pwr, shift); PI [vad] = L_add (Pl [vad], pwr); move32 O; pwr = power (module, param. beta_a_priori2, long_term_nois e [subband], long_term_noise [subband]); pwr = L_shr (pwr, shift);

P2 [vad] = L_add (P2 [vad], pwr); move32 (); }} else {for (subband = start; subband <= stop; subband ++) {ord32 pwr;

Wordlβ shift;

Wordlβ module;

Wordlβ alpha_long_term; alpha_long_term = mult (alpha [subband], long_term_noise [s ubband]); movelβO; module = sub (E [subband], shl (alpha_long_term, 2)); movel 6 (); if (uniform_subband) {shift = shl (frequential_scale [subband], 1); movelβO; } else {shift = add (param. subband_shift [subband], shl (frequen tial_scale [subband], 1)); movelβO; } pwr = power (module, param.beta_a_prioril, long_term_nois e [subband], E [subband]); pwr = L_shr (pwr, shift);

PI [vad] = L_add (Pl [vad], pwr); move32 (); pwr = power (module, param. beta_a_priori2, long_term_nois e [subband], E [subband]); pwr = L_shr (pwr, shift);

P2 [vad] = L_add (P2 [vad], pwr); move32 (); }}}}

*.

* power * /

Word32 power (Wordlβ module, Wordlβ beta, Wordlβ thd, ordlβ val)

{ord32 power; testo; if (sub (module, mult (beta, thd)) <= 0) {ordlβ hi, lo; power = L_mult (val, val); move32 O;

L_Extract (power, & hi, &lo); power = Mpy_32_16 (hi, lo, beta); move32 ();

L_Extract (power, & hi, &lo); power ≈ Mpy_32_16 (hi, lo, beta); move32 (); } else {power = L_mult (module, module); move32 O;

} return (power); AN N EXE 2

/ * ******** _* ******************** _* ************ _* ***** ****************** _* *****

* description

*. ,,,

* NSS module:

* VAD *

************************************************** *****************

******

/ * *

* included files

* * /

#include <assert.h>

#include "private.h"

#include "simutool.h"

/ * *

* private

* * /

#define DELTA_P (1.6 * 1024]

#define D_NOISE (.2 * 1024)

#define D_SIGNAL (.2 * 1024)

#define SNR_SIGNAL (.5 * 1024)

#define SNR NOISE (.2 * 1024)

'* *

* voice_activity_detector * / void voice_activity_detector

(

/ * IN * / ord32 * P1, ord32 * P2, ordlβ frame_counter,

/ * IN & OUT * / ord32 * Pls, Word32 * P2s, ordlβ * internal_state,

/ * OUT * / Wordlβ * state

)

{int vad; int signal; int noise; signal = 0; movelβO; noise = 1; movelβO; for (vad = 0; vad <param. vad_number; vad ++) {

Wordlβ snr, d;

Wordlβ logPl, logPls;

Wordlβ logP2, logP2s;

logP2 = logfix (P2 [vad]); movelβO; logP2s = logfix (P2s [vad]); movelβO test (); if (L_sub (P2 [vad], P2s [vad])> 0) {Wordlβ hil, loi; ordlβ hi2, lo2;

L_Extract (L_sub (Pl [vad], Pis [vad]), & hil, &lol); ^• L_Extract (L_sub (P2 [vad], P2s [vad]), & hi2, &lo2); test (); if (sub (sub (logP2, logP2s), DELTA_P) <0) {

Pls [vad] = L_add (Pls [vad], L_shr (Mpy_32_16 (hil, law, 0x6 βββ), 4)); move32 ();

P2s [vad] = L_add (P2s [vad], L_shr (Mpy_32_16 (hi2, lo2, 0x6 βββ), 4)); move32 (); } else {

Plsfvad] = L_add (Pls [vad], L_shr (Mpy_32_16 (hil, law, 0x6 8db), 13)); move32 ();

P2s [vad] = L_add (P2s [vad], L_shr (Mpy_32_16 (hi2, lo2, 0x6 8db), 13)); move32 (); }} else {

Worse [vad] = PI [vad]; move32 (); P2s [vad] = P2 [vad]; move32 O; }

logPl = logfix (PI [vad]); movelβO; logPls = logfix (Pis [vad]); movelβO

d = subdogPl, logP2); movelβO; snr = sub (logPl, logPls); movelβO;

ProbeFixlβO'd ", & d, 1, 1.); ProbeFixlβ (" _snr ", & snr, 1, 1.);

Wordlβ pp;

ProbeFixlβO'pl ", SlogPl, 1, 1.); ProbeFixlβ (" p2 ", & logP2, 1, 1.); ProbeFixlβO'pls", SlogPls, 1, 1.); ProbeFixlβ ("p2s", & logP2s, 1, 1.); pp = logP2 - logP2s; ProbeFixlβC'dp ", & pp, 1, 1.); test (); if (sub (internal_state [vad], NOISE) == 0) goto LABEL_NOISE; testo; if (sub (internal_state [vad], ASCENT) == 0) goto LABEL_ASCENT; testo; if (sub (internal_state [vad], SIGNAL) == 0) goto LABEL_SIGNAL; testo; if (sub (internal_state [vad], DESCENT) == 0) goto LABEL_DESCENT;

LABEL_NOISE: testO; if (sub (d, D_NOISE) <0) {internal_state [vad] = ASCENT; movelβO; } goto LABEL_END_VAD;

LABEL_ASCENT: testO; if (sub (d, D_SIGNAL) <0) {internal_state [vad] = SIGNAL; movelβO; signal = 1; movelβO; noise = 0; movelβO; } else {internal_state [vad] = NOISE; movelβO;

} goto LABEL_END_VAD;

LABEL_SIGNAL: testO; if (sub (snr, SNR_SIGNAL) <0) {internal_state [vad] = DESCENT; movelβO; } else {signal = 1; movelβO;

} noise = 0; movel6 (); goto LABEL_END_VAD;

LABEL_DESCENT: testO; if (sub (snr, SNR_NOISE) <0) {internal_state [vad] = NOISE; movel6 (); } else {internal_state [vad] = SIGNAL; movelβO; signal = 1; movelβO; noise = 0; movelβO;

} goto LABEL_END_VAD;

LABEL END VAD:

}

* state = TRANSITION; movelβO; testo; testo; if (signal! = 0) {testO; if (sub (frame_counter, param. init_frame_nurtιber)> = 0) {fo (vad = 0; vad <param. vad_number; vad ++) {internal_state [vad] = SIGNAL; movelβO;

}

* state ≈ SIGNAL; movelβO;

} } else if (noise! = 0) ^{

* state = NOISE; movel6 ()

}

Claims

1. Method for detecting voice activity in a digital speech signal (s) in at least one frequency band, characterized in that the voice activity is detected on the basis of an analysis comprising a comparison, in said frequency band, of two different versions of the speech signal, at least one of which is a denoised version obtained by taking into account estimates of the noise included in the signal.

2. Method according to claim 1, in which said comparison relates to respective energies (E _{1 n} =, E _{2 n} ;), evaluated in said frequency band, of the two different versions of the speech signal, or on a monotonic function of said energies.

3. Method according to claim 1 or 2, wherein said analysis further comprises a time smoothing of the energy (E _{1 n} :) of one of said versions of the speech signal, and a comparison between the energy of said version and the smoothed energy (Ej _n ).

4. Method according to claim 3, in which the comparison between the energy of said version (E _{1 n} =) and the smoothed energy (E _{1 n} ) controls the transitions of a voice activity detection automaton of a state of speech towards a state of silence, while the comparison of the two different versions of the speech signal controls the transitions of the automaton of detection of the state of silence towards the state of speech.

5. Method according to any one of claims 1 to 4, in which the two different versions of the speech signal are two versions denoised by non-linear spectral subtraction, a first of the two versions (Êp-i _n ) being denoised so not to be less, in the spectral domain, than a first fraction (β1 _s ) of a long-term estimate (B _n j) representative of a noise component included in the speech signal, and the second of the two versions _{(EP2, n,} i) e ^as denoised so as to be not less than, in the spectral domain, a second fraction (β2 _j) of said long-term estimate, smaller than the first fraction.

6. Method according to claim 5, in which a time smoothing of the energy of each of the two versions of the speech signal is carried out, by means of a smoothing window determined by comparing the energy (E _{2 nj} ) of the second of the two smoothed energy versions (E ^ n) of the second of the two versions.

7. The method of claim 6, wherein the smoothing window is an exponential window defined by a forgetting factor (λ).

8. The method of claim 7, wherein the forgetting factor (λ) has a value (λ _r ) substantially zero when the energy (E _{2 nj} ) of the second of the two versions is less than a value of the order of the smoothed energy (E2 _ιn ) of the second of the two versions.

9. The method of claim 8, wherein the forgetting factor (λ) has a first value (λ _q ) substantially equal to 1 when the energy (E _{2 nj} ) of the second of the two versions is greater than said value of the order of the smoothed energy multiplied by a coefficient (Δ) greater than 1, and a second value

(λ _p ) between 0 and said first value when the energy of the second of the two versions is greater than said value on the order of smoothed energy and less than said value on the order of multiplied smoothed energy by said coefficient.

10. Method according to any one of claims 5 to 9, in which the first and second fractions (β1 _jt β2 _j ) correspond substantially to attenuations of 10 dB and 60 dB, respectively.

11. Method according to any one of claims 1 to 10, in which the comparison of the two different versions of the speech signal relates to respective differences between the energies (E _{1 n} E _{2 n} ι) of these two versions in said band. of frequencies and a decrease (E _{2min j} ) of the energy

(E _{2 n} :) of the denoised version of the speech signal in said frequency band.

The method according to claim 11, wherein one of the two different versions of the speech signal is an un denoised version of the speech signal.

13. Device for detecting voice activity in a speech signal, comprising signal processing means (15) arranged to implement a method according to any one of claims 1 to 12.

14. Computer program, loadable in a memory associated with a processor, and comprising portions of code for the implementation of a method according to any one of claims 1 to 12 during the execution of said program by the processor.

15. IT support, on which is recorded a program according to claim 14.