CN1175398C

CN1175398C - Sound activation detection method for identifying speech and music from noise environment

Info

Publication number: CN1175398C
Application number: CNB001274945A
Authority: CN
Inventors: 黎家力
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2000-11-18
Filing date: 2000-11-18
Publication date: 2004-11-10
Anticipated expiration: 2020-11-18
Also published as: CN1354455A

Abstract

The present invention discloses a sound activity detection method for recognizing voice and music from a noise environment. The present invention takes a signal to noise ratio as a judgement standard for sound activity detection. The present invention comprises the steps that sampled data is converted to a frequency domain through FFT, and the sampled data is divided into different subbands in a non-linear mode in the frequency domain; the energy and the signal to noise ratio measure of each subband are calcualted; the update of the noise energy of each subband and the calculation of the signal to noise ratio measure of each subsband are respectively executed in the foreground and the background, and the foreground and the background alternatively executes control; the signal to noise ratio measure is used as the standard for judging noise, the voice and the music. The present invention can be used to accurately detect the voice and the music in the noise environment so as to make a system have strong environmental noise resistance and have strong adaptability to various effective sound signals.

Description

A kind of voice activity detection method that from noise circumstance, identifies voice and music

Technical field

The present invention relates to the voice activity detection technique in the digital communication system, more specifically, relate to a kind of voice activity that can from the input signal that is mixed with neighbourhood noise, identify voice and music signal exactly and detect (Voice Activity Detection) method.

Background technology

The voice activity detection technique is widely used in communication system, for example, uses the voice activity detection technique in mobile communication system, can improve the flow processing power of system.And for example, in the audio mixing module of the multipoint control unit of video conference, use the voice activity detection technique, only allow detect the audio code stream that the someone speaks and participate in audio mixing, can reduce the number of terminals of participating in audio mixing, improve the quality of audio mixing.

General voice activity detection method is to utilize the parameter in short-term of energy, zero-crossing rate, pitch period or other voice signals to be used as the foundation that judges whether that the someone talks, when ground unrest is big, adopt these methods can cause erroneous judgement, and these parameters all are to be based upon on people's the sonification model, so be not suitable for music.And in multimedia communication system, music often is employed as a kind of important medium, and general voice activity detection method only is applicable to the speech detection that the people speaks, and does not adapt to the such non-stationary process of music.

Summary of the invention

The purpose of this invention is to provide a kind of voice activity detection method that is applicable in the noisy environment and can accurately detects voice and music, make system have very strong anti-neighbourhood noise ability, simultaneously again various effective voice signals had very strong adaptability, be specially adapted in the multimedia communication system, as video conferencing system.

In order to finish goal of the invention, described a kind of voice activity detection method that identifies voice and music from noise circumstance may further comprise the steps:

1, at first resulting sampled data is converted on the frequency domain by fast fourier transform FFT;

2, on frequency domain, non-linearly be divided into different subbands, calculate the energy and the foreground signal to noise ratio (S/N ratio) of each subband then respectively, and calculate the foreground signal to noise ratio (S/N ratio) by the foreground signal to noise ratio (S/N ratio) and estimate;

If 3 present frames are first frames, then current state is changed to the foreground state;

4, the various statistics of estimating according to current signal to noise ratio (S/N ratio) are controlled the work on foreground and backstage;

When if 5 current states are in the foreground state, current foreground signal to noise ratio (S/N ratio) is estimated and selected threshold, judge and handle;

When if 6 current states are in background state, start backstage subband noise energy and upgrade, calculate backstage signal to noise ratio (S/N ratio) and backstage signal to noise ratio (S/N ratio) and estimate, and judge and handle according to the statistic that signal to noise ratio (S/N ratio) is estimated;

If 7 current states are in an interim state, then enter transition state and handle, the statistic of estimating according to signal to noise ratio (S/N ratio) is further judged again, determines finally to enter foreground state or background state;

8, estimate according to the requirement of external module output foreground signal to noise ratio (S/N ratio) or export to estimate and judge the quiet sign that draws controlled flag as voice activity detection (VAD) by the foreground signal to noise ratio (S/N ratio);

9, calculate and export the gross energy (this step is optional) of this each subband of frame according to the requirement of external module;

10, get back to step 1, continue to handle next frame.

In the above-mentioned described step 8 of voice activity detection method,, then put sound flag, otherwise put quiet sign if the foreground signal to noise ratio (S/N ratio) is estimated greater than threshold value one.

By such scheme as seen, this has the physical quantity of ubiquity because of the voice activity detection method of the present invention's realization has been used signal to noise ratio (S/N ratio).So compare obvious advantage with wide adaptability with additive method, both can detect voice, can detect music again, simultaneously very strong noise resisting ability is arranged again, be applicable to various noise circumstances, and can adapt to the hardware of various input gains and different signal to noise ratio (S/N ratio)s.Be specially adapted to multimedia communication system.

Description of drawings

The invention will be further described below in conjunction with drawings and Examples.

Fig. 1 is the process flow diagram of the method for the invention.

Fig. 2 uses this method in the process flow diagram of a system.

Embodiment

Below in conjunction with Fig. 1 this method is specified:

The present invention is based upon the criterion that voice activity detects on this physical quantity of signal to noise ratio (S/N ratio).Because the appreciable frequency spectrum of people's ear mainly concentrates on below the 4KHz, simultaneously in order to reduce operand, the present invention is sampled as example with 8K, but for other sampling rates, as long as change some parameter, the method applied in the present invention is suitable equally.The first step at first is converted to resulting sampled data on the frequency domain by fast Fourier transform (FFT):

The input voice are represented with s (n).The frame length of algorithm is 10ms, and promptly 80 point data are a frame (L=80), and adopts the overlapping method of interframe, and overlapping number of data points D is 24.Like this, and input data frame buffer zone d (m, number of data points n) is the L+D=104 point, wherein preceding D point data is the last D point data of former frame, promptly

d(m，n)＝d(m-1，L+n)，0≤n＜D

Here m represents the numbering of present frame.

Input voice s (n) are carried out pre-emphasis handle, then have

d(m，D+n)＝s(n)+ξ _ps(n-1)，0≤n＜L

ξ wherein _p=-0.8 is pre emphasis factor.

(m n) carries out windowing process with level and smooth trapezoid window, and zero padding then forms the discrete Fourier transform (DFT) input data g (n) that M=128 is ordered, that is: to the input data d after the pre-emphasis

G (n) is carried out discrete Fourier transform (DFT), obtains the frequency spectrum G (k) of input signal:

G (k) = \frac{2}{M} Σ_{N = 0}^{M - 1} g (n) e^{- j 2 πnk / M}; 0 \leq k < M

In actual computation, consider that g (n) is a real number, so the plural fast fourier transform that available M/2 is ordered is calculated the real number fast fourier transform that M is ordered fast.

To the 16K sampling, 160 point data are a frame (L=160), and adopt the overlapping method of interframe, and overlapping number of data points D is 48.Like this, (m, number of data points n) is the L+D=208 point to input data frame buffer zone d, carries out 256 point fast Fourier conversion.

Second step non-linearly was divided into different subbands, calculated the energy and the foreground signal to noise ratio (S/N ratio) of each subband then respectively,

And calculate to such an extent that the foreground signal to noise ratio (S/N ratio) is estimated by the foreground signal to noise ratio (S/N ratio):

(1), the ENERGY E of each subband of present frame _Ch(m) calculate by following formula:

E_{ch} (m, i) = \max {E_{\min}, α_{ch} (m) E_{ch} (m - 1, i) + (1 - α_{ch} (m)) \frac{1}{f_{H} (i) - f_{L} (i) + 1} Σ_{k = f_{L} (i)}^{f_{H} (i)} {| G (k) |}^{2}} 0 \leq i < N_{C}

N wherein _C=16 is sub band number, E _Min=0.0625 is the subband least energy, α _Ch(m) be the sub belt energy smoothing factor.Smoothing factor α _Ch(m) be defined as

f _L(i) and f _H(i) be the position of i son band starting and ending, wherein f _LAnd f _HBe defined as follows:

f _L＝{2，4，6，8，10，12，14，17，20，23，27，31，36，42，49，56}，

f _H＝{3，5，7，9，11，13，16，19，22，26，30，35，41，48，55，63}

Sample for 16K:

f _L＝{2，4，6，8，10，12，14，17，20，23，27，31，36，42，49，57，66，77，90，106}，

f _H＝{3，5，7，9，11，13，16，19，22，26，30，35，41，48，56，65，76，89，105，127}

(2), subband SNR estimation

Be calculated as follows the signal to noise ratio (S/N ratio) σ of subband _q(i)

σ_{q} (i) = \max {0, \min {89, round {10 \log_{10} (\frac{E_{ch} (m, i)}{E_{n} (m, i)}) / 0.375}}}; 0 \leq i < N_{c}

E wherein _n(m i) is the estimated value of i subband noise energy of present frame, the 0.375th, and the quantization step of signal to noise ratio (S/N ratio).σ _q(i) be quantified as integer, and be limited between 0 and 89.

(3), calculate signal to noise ratio (S/N ratio) and estimate (SNR Metric)

It is the similarity degree of recently describing present frame and voice according to the subband noise that signal to noise ratio (S/N ratio) is estimated v (m), and it is that the sign present frame is the voice or the criterion of noise

v (m) = Σ_{i = 0}^{N_{c} - 1} V (σ_{q} (i))

Wherein V (k) estimates table { k value among the V} for signal to noise ratio (S/N ratio).{ V} has 90 elements, is defined as

V＝{2，2，2，2，2，2，2，2，2，2，2，3，3，3，3，3，4，4，4，5，5，5，6，6，7，7，7，8，8，9，9，10，10，11，12，12，13，13，14，15，15，16，17，17，18，19，20，20，21，22，23，24，24，25，26，27，28，28，29，30，31，32，33，34，35，36，37，37，38，39，40，41，42，43，44，45，46，47，48，49，50，50，50，50，50，50，50，50，50，50}。

In the 3rd step,, then current state is changed to the foreground state if this frame is first frame.

The 4th step, the various statistics control foreground of estimating according to signal to noise ratio (S/N ratio) and the work on backstage.

In the 5th step,, carry out to judge and to handle if when current state is the foreground state:

1) estimates when the foreground signal to noise ratio (S/N ratio) and be lower than threshold value and think noise for the moment, start the foreground noise energy and upgrade;

2) if current be the foreground state, and if continuous 2 seconds foreground signal to noise ratio (S/N ratio)s estimate greater than threshold value and think for the moment and will enter transition state, then current each sub belt energy as backstage subband noise energy, and to put current state be transition state;

3) when the foreground signal to noise ratio (S/N ratio) was estimated greater than threshold value two in continuous 2 seconds, think music, forbid that simultaneously foreground subband noise energy upgrades, and to put current state be background state;

4) changeed for the 8th step.

The 6th step when if current state is in background state, started backstage subband noise energy and upgrades, and calculated backstage signal to noise ratio (S/N ratio) and backstage signal to noise ratio (S/N ratio) and estimated, and carried out simultaneously to judge and to handle:

1) calculating backstage signal to noise ratio (S/N ratio) and backstage signal to noise ratio (S/N ratio) estimates;

2) if continuous 6 seconds backstage signal to noise ratio (S/N ratio)s estimate greater than threshold value one, current each sub belt energy as backstage subband noise energy;

3) estimating statistic in a period of time when the backstage signal to noise ratio (S/N ratio) satisfies specified conditions or judges the foreground signal to noise ratio (S/N ratio) and estimate continuous 1 second during less than threshold value two, then backstage subband noise energy as foreground subband noise energy, putting current state simultaneously is the foreground state, stop background processes, restart foreground noise energy renewal process, and changeed for the 8th step;

4) estimate when the backstage signal to noise ratio (S/N ratio) and be lower than threshold value for the moment, start the backstage noise energy and upgrade;

5) changeed for the 8th step.

In the 7th step,, then carry out following judgement if current state is a transition state:

1) when the foreground signal to noise ratio (S/N ratio) was estimated greater than threshold value two in continuous 2 seconds, think music, and to put current state is background state;

2) calculating backstage signal to noise ratio (S/N ratio) and backstage signal to noise ratio (S/N ratio) estimates;

3) if continuous 6 seconds backstage signal to noise ratio (S/N ratio)s estimate greater than threshold value one, current each sub belt energy as backstage subband noise energy;

4) estimating statistic in a period of time when the backstage signal to noise ratio (S/N ratio) satisfies specified conditions or judges the foreground signal to noise ratio (S/N ratio) and estimate continuous 1 second during less than threshold value two, then backstage subband noise energy as foreground subband noise energy, putting current state simultaneously is the foreground state, changes step 6);

5) when the backstage signal to noise ratio (S/N ratio) estimate be lower than threshold value and start for the moment after the stage noise upgrade;

6) estimate when the foreground signal to noise ratio (S/N ratio) and be lower than threshold value and think noise for the moment, the stage noise upgrades before starting;

7) changeed for the 8th step.

In the 8th step, according to the requirement of external module, output foreground signal to noise ratio (S/N ratio) is estimated or export by the foreground signal to noise ratio (S/N ratio) and estimate quiet sign that judgement the draws controlled flag as VAD.

The 9th goes on foot, and calculates and export the gross energy (this step is optional) of this each subband of frame according to the requirement of external module.

The tenth step repeated the first step, and the next frame data are handled.

In described voice activity detection method, the span of threshold value one is 35～40, and the span of threshold value two is that threshold value one adds that five add ten to threshold value one.

In described the 8th step,, then put sound flag, otherwise put quiet sign if the foreground signal to noise ratio (S/N ratio) is estimated greater than threshold value one.

In described the 6th step and the 7th step, the statistics that the backstage signal to noise ratio (S/N ratio) is estimated comprises: with 20 subframes is a multi-frame (200ms), and to each subframe, if the backstage signal to noise ratio (S/N ratio) of this subframe is estimated greater than threshold value one, then statistic subtracts 1; Otherwise statistic adds 1.

One of following situation satisfies described specified conditions:

1. continuous 30 multi-frames, statistic is greater than zero;

Statistic greater than zero with the ratio of minus multi-frame number greater than 35 to 7;

More than two conditions noisiness that current sound all is described obvious.

In described voice activity detection method, the ground unrest energy of next frame is pressed following formula and is upgraded:

E _n(m+1, i)=max{E _Min, α _nE _n(m, i)+(1-α _n) E _Ch(m, i) }, 0 i＜N _CE wherein _MinThe=0.00625th, the subband least energy that allows.α _nThe=0.9th, subband noise energy smoothing factor, it directly influences the renewal speed of subband noise energy estimated value.Usually, with the initial value of each the frame sub belt energy in preceding four frames as the subband noise energy

E _n(m，i)＝max{E _init，E _ch(m，i)}，1 m 4，0 i＜N _C

E wherein _Init=16.

Below in conjunction with Fig. 2 the flow process that the present invention is applied in the total system is described:

After the compressed bit stream input of every road voice, through decoding, the back signal of will decoding is analyzed and is handled with this method, the signal to noise ratio (S/N ratio) of exporting each road is then estimated the gross energy (tce) of (SNR) and each subband and is given the audio mixing module, last by the size of audio mixing module according to SNR and tce, audio mixing is participated on the n road before selecting.Because the operand of this method is very little, so can be made in on a slice digital signal processing (DSP) chip with decoding, also can be made in on a slice dsp chip with the audio mixing algorithm.

Claims

1, a kind of voice activity detection method that identifies voice and music from noise circumstance is characterized in that, may further comprise the steps:

1) at first resulting sampled data is converted on the frequency domain by fast fourier transform;

2) on frequency domain, non-linearly be divided into different subbands, calculate the energy and the foreground signal to noise ratio (S/N ratio) of each subband then respectively, and calculate the foreground signal to noise ratio (S/N ratio) by the foreground signal to noise ratio (S/N ratio) and estimate;

3) if present frame is first frame, then putting current state is the foreground state;

4) the various statistics of estimating according to current signal to noise ratio (S/N ratio) are controlled the work on foreground and backstage;

5) if when current state is in the foreground state, estimate when the foreground signal to noise ratio (S/N ratio) and to be lower than threshold value and to think noise for the moment, start the foreground noise energy and upgrade; If the foreground signal to noise ratio (S/N ratio) was estimated greater than threshold value for the moment in continuous 2 seconds, putting current state is transition state; When the foreground signal to noise ratio (S/N ratio) was estimated greater than threshold value two in continuous 2 seconds, think music, forbid that simultaneously foreground subband noise energy upgrades, and to put current state be background state;

6) if when current state is in background state, start backstage subband noise energy and upgrade, calculate backstage signal to noise ratio (S/N ratio) and backstage signal to noise ratio (S/N ratio) and estimate; If the backstage signal to noise ratio (S/N ratio) was estimated greater than threshold value one in continuous 6 seconds, current each sub belt energy as backstage subband noise energy; Estimating statistic in a period of time when the backstage signal to noise ratio (S/N ratio) satisfies specified conditions or judges the foreground signal to noise ratio (S/N ratio) and estimate continuous 1 second during less than threshold value two, then backstage subband noise energy as foreground subband noise energy, putting current state simultaneously is the foreground state, stop background processes, restart foreground noise energy renewal process, and change step 8); Estimate when the backstage signal to noise ratio (S/N ratio) and to be lower than threshold value for the moment, start the backstage noise energy and upgrade;

7) if current state is in an interim state, then enter transition state and handle, the statistic of estimating according to signal to noise ratio (S/N ratio) is further judged again, determines finally to enter foreground state or background state:

(1) when the foreground signal to noise ratio (S/N ratio) was estimated greater than threshold value two in continuous 2 seconds, think music, and to put current state is background state;

(2) calculating backstage signal to noise ratio (S/N ratio) and backstage signal to noise ratio (S/N ratio) estimates;

(3) if continuous 6 seconds backstage signal to noise ratio (S/N ratio)s estimate greater than threshold value one, current each sub belt energy as backstage subband noise energy;

(4) estimating statistic in a period of time when the backstage signal to noise ratio (S/N ratio) satisfies specified conditions or judges the foreground signal to noise ratio (S/N ratio) and estimate continuous 1 second during less than threshold value two, then backstage subband noise energy as foreground subband noise energy, putting current state simultaneously is the foreground state, changes step (6);

(5) when the backstage signal to noise ratio (S/N ratio) estimate be lower than threshold value and start for the moment after the stage noise upgrade;

(6) estimate when the foreground signal to noise ratio (S/N ratio) and be lower than threshold value and think noise for the moment, the stage noise upgrades before starting;

(7) change step 8);

8) estimate according to the requirement of external module output foreground signal to noise ratio (S/N ratio) or export to estimate and judge the quiet sign that draws controlled flag as the voice activity detection by the foreground signal to noise ratio (S/N ratio);

9) get back to step 1), continue to handle next frame.

2, voice activity detection method as claimed in claim 1 is characterized in that described step 8) and 9) between can also increase: the gross energy that calculates and export this each subband of frame according to the requirement of external module.

3, voice activity detection method as claimed in claim 1, it is characterized in that, in described step 6) and the step 7), the statistic that the backstage signal to noise ratio (S/N ratio) is estimated is to calculate like this: with 20 subframes is a multi-frame, to each subframe, if the backstage signal to noise ratio (S/N ratio) of this subframe is estimated greater than threshold value one, then statistic subtracts 1; Otherwise statistic adds 1.

4, voice activity detection method as claimed in claim 1 is characterized in that in the described step 8), if the foreground signal to noise ratio (S/N ratio) is estimated greater than threshold value one, then puts sound flag, otherwise puts quiet sign.

5, as the described voice activity detection method of one of claim 1 to 4, it is characterized in that: the span of described threshold value one is 35～40.

6, as the described voice activity detection method of one of claim 1 to 4, it is characterized in that: the span of described threshold values two be the value of described threshold value one add five and the value of described threshold values one add between ten.