CN100580770C

CN100580770C - Voice end detection method based on energy and harmonic

Info

Publication number: CN100580770C
Application number: CN200510089957A
Authority: CN
Inventors: 国雁萌; 付强
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2005-08-08
Filing date: 2005-08-08
Publication date: 2010-01-13
Anticipated expiration: 2025-08-08
Also published as: CN1912993A

Abstract

This invention relates to a phone end point test method based on energies and harmonic waves including the following steps: pre-processing digitalized sound signals and storing them in a section of buffer storage, shifting and updating the buffer storage and regulating the threshold value every time inputting a signal to judge the Tstart based on the energy, then searching for signals with sonant harmonic wave character from the buffer storage, if sonant is found, then the accurate Tstart is searched, on the contrary, the phone Tstart is searched based on the energy then to search for the phone terminal point according to the signal energy. Advantages: this invention can adjust accuracy based on noise strength so it is adaptive to the S/N ratio of input signals and energy test sphere is rather wide so phones with weak energy will not be omitted.

Description

Sound end detecting method based on energy and harmonic wave

Technical field

The present invention relates to the automatic speech recognition field, particularly a kind of sound end detecting method.

Background technology

The input signal of automatic speech recognition system normally has the voice of noise, for preventing that the signal segment that does not contain voice from entering recognizer, to guarantee system performance and to reduce computing cost that need detect the starting point and the terminal point of user speech in signal, this process is called end-point detection.

Usual end point detection algorithm can be divided into rule-based and based on model two classes.The energy of rule-based method general using signal, zero-crossing rate, cepstrum, feature calculation such as spectrum estimations goes out distance when long, and by apart from the comparison and the logical operation of threshold value, determine whether voice exist.Method based on model is generally set up model respectively at the statistical property of noise and voice, makes judgement according to likelihood score then.

The applied environment of automatic speech recognition system is complicated, therefore, end-point detection must have extensively adaptability reliably, this comprises: adapt to the gradual noise of various intensity and kind, adapt to the intensity of noise and kind rapidly and change, not influenced by of short duration very noisy, and keep stable accuracy, operation efficiency and time delay in all cases.But in the noise complex environment, not only noise itself does not have fixing feature, and phonetic feature is also often by noise takeover.Method based on model often only is applicable to specific environment, and both has been subjected to the signal to noise ratio (S/N ratio) restriction easily based on the method for single feature-set rule, again noise type is changed relatively more responsive.Therefore, a lot of people improve the reliability of end-point detection by a plurality of features are combined.For example utilize energy overlaying relation and the cepstrum feature that proposed in 2002 of people such as Sahar E.Bou-Ghazale carries out the method for end-point detection, this method can be in the stationary noise environment steady operation, but this method is very sensitive to the noise energy sudden change, and because the cepstrum feature of voice can be affected by noise, thereby can not be used for the situation of low signal-to-noise ratio.The end-point detecting method that signal energy and voiced sound feature are combined that proposed in 2003 of Arnaud Martin and for example, the shortcoming of this method is in the sudden change that is difficult to the tracking noise energy, in addition, it needs the fundamental frequency of first detection signal could determine voiced sound, so be subjected to the interference of fundamental tone frequency multiplication easily, simultaneously, this algorithm has used the comb filter on the frequency domain, thereby also searches for voiced sound than difficulty in complicated noise.

Summary of the invention

The purpose of this invention is to provide a kind of the combined end-point detecting method of the voiced sound harmonic characteristic of signal energy and voice, this method is applicable to most of voice and noise circumstance, can not only be under multiple noise type and intensity steady operation, and can adapt to the unexpected variation of noise type and intensity rapidly.

For achieving the above object, the sound end detecting method based on energy and harmonic wave provided by the invention comprises the steps:

1) digitized sound signal to input divides frame and carries out pre-service such as windowing, pre-emphasis, and every frame signal comprises F sampled point, and length is about 25 milliseconds, overlaps about 15 milliseconds between consecutive frame;

2) adopt one section buffer memory, preserve top L frame (L＞25, the time span of L frame is greater than 200 milliseconds) signal;

3) by the energy size of signal in the buffer memory voice starting point is carried out Preliminary detection; Comprise:

31) r frame signal estimated noise energy, wherein 7＜r＜13 before utilizing in the buffer memory; Use symbol E _iRepresent i frame signal energy, r frame signal average energy value before calculating

\overset{&OverBar;}{E} = \frac{1}{r} Σ_{i = 1}^{r} E_{i}

And mean square deviation

σ = {(\frac{1}{r} Σ_{i = 1}^{r} {(E_{i} - \overset{&OverBar;}{E})}^{2})}^{1 / 2},

And mean square deviation carried out normalization, i.e. d=σ/E, E and d as initial estimation to noise profile;

32) according to noise energy, set energy threshold Thd, computing method are: Thd=E+ σ/α=E+Ed/ α, wherein α is a sensitivity coefficient, 0＜α＜1;

33) after energy threshold Thd determines, carry out energy measuring; If t is arranged in the buffer memory _{Start_energy}The average energy value of millisecond continuous signal is greater than Thd, and then the 1st frame of this section of mark continuous signal is T _Start, and trigger harmonic wave detection, wherein t _{Start_energy}Be constant, unit is a millisecond, and scope is 80＜t _{Start_energy}＜150; If do not satisfy the signal segment of above-mentioned condition in the buffer memory, then import a frame new signal, upgrade buffer memory and threshold value Thd, and search voice starting point in the buffer memory after renewal; Every input one frame new signal just upgrades buffer memory one time, and upgrades threshold value Thd thereupon, up to finding T _StartTill; The process of upgrading buffer memory is: abandon the 1st content frame in the buffer memory, buffer memory the 2nd content to the L frame is moved forward a frame, a new frame signal is deposited in buffer memory L frame;

4) tentatively declare in energy measuring and voice starting point T _StartAfter, search has the signal of voiced sound harmonic characteristic from current cache, if searched voiced sound, then thinks to have voice in the buffer memory, enters step 5) the voice starting point is carried out further fine search; If do not search voiced sound, then delete T _StartMark, and get back to step 3);

5) precise search voice starting point in current cache; Comprise: from T _StartSearch for a frame signal that satisfies following condition in turn forward: from this frame, signal energy increases progressively frame by frame, and then the former frame of this frame signal just is judged to the voice starting point.

In the technique scheme, after described step 5) was finished, according to the energy measuring voice terminal point of signal in the buffer memory, testing process was: if a segment length is arranged in the buffer memory is special t _{End_energy}The continuous signal of millisecond (t wherein _{End_energy}Be constant, unit is a millisecond, and scope is 60＜t _{End_energy}＜120), its every frame energy is all greater than end point energy threshold value Thd _End, then upgrade buffer memory; Bring in constant renewal in buffer memory and threshold value Thd _End, then first frame signal in the buffer memory is judged to the voice terminal point less than such signal segment up to search in buffer memory.

In the technique scheme, in described step 3) and the step 4), the process of described renewal energy threshold comprises: the average energy value of r frame signal before calculating in the buffer memory

\overset{&OverBar;}{E_{new}} = \frac{1}{r} Σ_{i = 1}^{r} E_{i}

With the normalization mean square deviation

d_{new} = σ_{new} / \overset{&OverBar;}{E_{new}} = {(\frac{1}{r} Σ_{i = 1}^{r} {(E_{i} - \overset{&OverBar;}{E_{new}})}^{2})}^{1 / 2} / \overset{&OverBar;}{E_{new}},

With following logical renewal E and d:

A) if E _New＜Ec _UpdateAnd d _New＜dc _Update, E=E then _New, d=d _NewC wherein _UpdateBe that noise upgrades constant, 1＜c _Update＜1.5, this strategy can guarantee that system follows the tracks of stationary noise at any time;

B), but satisfy E if do not satisfy condition a) _New＜E10 and d _New＜d, then E=E _New, and d=d _New, this update strategy helps to adapt to as early as possible the noise that step changes;

C) if do not satisfy condition a) and b), but satisfy E _New＜E10, d _New＜1 and d _New＜d1.5, then E=(E _New+ E)/2, d=(d _New+ d)/2, this strategy can prevent common noise renewal " deadlock " situation, helps to follow the tracks of as early as possible the step noise that energy is uprushed.

D) if above-mentioned condition does not satisfy, illustrate that this r frame signal is not a stationary noise, do not upgrade so do not carry out noise.

After having upgraded E and d, calculate the threshold value that makes new advances with formula Thd=E+Ed/ α.

In the technique scheme, described end point energy threshold value Thd _EndDetermine by noise energy average E and normalization mean square deviation d, be expressed as: Thd _End=E+Ed; Buffer update is once used the average energy value of preceding r frame signal equally

\overset{&OverBar;}{E_{new}} = \frac{1}{r} Σ_{i = 1}^{r} E_{i}

With the normalization mean square deviation

d_{new} = σ_{new} / \overset{&OverBar;}{E_{new}} = {(\frac{1}{r} Σ_{i = 1}^{r} {(E_{i} - \overset{&OverBar;}{E_{new}})}^{2})}^{1 / 2} / \overset{&OverBar;}{E_{new}}

Upgrade E and d respectively; More new logic is: if E _New≤ E and d _New≤ d, then E=(E _New+ E)/2, d=(d+d _New)/2.

Advantage of the present invention is: at first, when the energy measuring initialization, introduced sensitivity coefficient, and can be automatically according to the noise intensity degree of regulation; Secondly, because energy measuring is positioned at the front end that the voiced sound harmonic wave detects in this method, and both be " with " logical relation, so energy threshold is established lowlyer, both filtered most of noise, the voice erroneous judgement can be noise again, this just can prevent from the more weak part of voice zero-point energy is cut away; In addition, upgrade owing to adopted one section buffer memory, every input one frame signal just to carry out noise, and on the threshold value update strategy, allow the sudden change of noise energy, change so this method can adapt to noise rapidly, thus stronger to the adaptability of noise.The present invention has certain adaptive ability to the intensity size of system input signal, and to age of speaker, sex with speak custom all without limits.Special advantage of the present invention is that also changing for the interchannel noise step in the telecommunication system has stronger adaptability.

Description of drawings

Fig. 1 is a principle flow chart of the present invention;

Fig. 2 is one section voice time-domain diagram and sound spectrograph of being subjected to noise;

Fig. 3 is harmonic wave testing process figure;

Fig. 4 is the harmonic wave judgement synoptic diagram of continuous 4 frames.

Embodiment

Principle of the present invention is as follows:

At first detect preliminary judgement voice starting point T by adaptive energy _Start, enter harmonic wave then and detect step, if find to exist in the buffer memory harmonic structure of voiced sound, then judge to have voiced sound in the buffer memory, and at T _StartNear precise search voice starting point; If do not find harmonic structure, then continue with energy measuring search voice starting point.After finding the voice starting point, detect search voice terminal point by adaptive energy.

Because voice and noise are the relation of stack on energy, so when noise energy changed relatively slowly, the appearance of voice can make signal energy increase suddenly.Energy measuring can constantly be followed the tracks of gradual noise, and the zone of finding energy to uprush is designated as the position that voice may occur with it.But burst noise also can make signal energy change, and may cause the energy measuring erroneous judgement, does further judgement so the present invention detects with harmonic wave, removes these mistakes.

Harmonic wave is the notable feature of voiced sound, and it shows as in sound spectrograph with the fundamental frequency is the even bright band of spacing.Because voice are made of voiceless sound and voiced sound combination, and the energy of voiced sound and duration are much larger than voiceless sound, so any significant voice segments all must contain long voiced sound.Because voiced energy mainly concentrates on fundamental frequency and harmonic wave, so even noise is very strong, noisy voiced sound still can the clear harmonic wave of retaining part (as shown in Figure 2).If in longer signal segment (＞200 milliseconds), all there is not the harmonic structure of voiced sound, can think that then current demand signal does not contain voice.

Below in conjunction with accompanying drawing and preferred embodiment the present invention is further described.

Embodiment:

Sound end detecting method based on energy and harmonic wave provided by the invention comprises 3 basic steps: energy measuring, harmonic wave detect, the voice end point determination.Input signal is a digitized sound signal, is divided into isometric (every frame signal comprises F sampled point, and time span is about 25 milliseconds) and about 15 milliseconds frame that overlaps mutually, and uses L frame buffer (L＞25, the time span of L frame signal is greater than 200 milliseconds).Specify each step workflow of the present invention below.

As shown in Figure 1, the present invention includes following steps:

Step 100: set the L frame buffer, preceding L frame input signal is deposited in the buffer memory, begin to carry out end-point detection.Every input one frame signal, the buffer memory automatic shift upgrades.

Step 200: according to energy Preliminary detection voice starting point.That is, set a signal energy threshold value Thd, judge whether there is voice signal in the current cache, if the judgment is Yes, then tentatively determine voice starting point T by the energy of signal in the buffer memory according to noise situations _Start, enter step 300; If the judgment is No, then enter step 700.The specific implementation process of this step is as follows:

The every frame signal of L frame of digital voice signal in the buffer memory is done pre-service (according to system's actual conditions, can comprise windowing, pre-emphasis etc.) respectively, and establishing every frame length is the F point, and first zero padding is to N point (N 〉=F wherein, N=2 ^x, x is integer and x 〉=8), carry out leaf transformation in the N point discrete Fourier, obtain discrete spectrum

X (i, bin) = Σ_{n = 0}^{N - 1} x (i, n) e^{- j (2 π / N) n \cdot bin},

Wherein x (i, n) n sampled point of i frame in the expression buffer memory, X (i, bin) represent i frame in the buffer memory bin Fourier transform value (bin=0,1 .., N-1).Through equally distributed bank of filters on the Mel scale (for example p quarter window wave filter, half overlaps between adjacent two filter), obtain the energy of each subband

E_{i} (j) = \ln {Σ_{bin = L_{j}}^{bin = H_{j}} T_{j} (bin) {| X (i, bin) |}^{2}},

(j＝0，1，2...p-1)。

Wherein, j is the numbering of sub-filter, and p is the Methods of Subband Filter Banks number, T _j(bin) be the frequency response of j wave filter, L _jAnd H _jBe respectively the initial frequency and the cutoff frequency of j wave filter.To the sub belt energy summation, obtain i frame signal energy at last

E_{i} = Σ_{j = 0}^{p - 1} E_{i} (j) .

Utilize the preceding r frame (generally getting 7＜r＜13) of input signal to finish the initialization of energy measuring.Because the user always opens system earlier and just loquiturs, so can think that this r frame signal only contains noise, calculates their average energy value

\overset{&OverBar;}{E} = \frac{1}{r} Σ_{i = 1}^{r} E_{i}

And mean square deviation

σ = {(\frac{1}{r} Σ_{i = 1}^{r} {(E_{i} - \overset{&OverBar;}{E})}^{2})}^{1 / 2},

And mean square deviation carried out normalization, i.e. d=σ/E, then E and d can be used as the initial estimation to noise profile.

The computing method of energy threshold Thd are: Thd=E+ σ/α=E+Ed/ α.Wherein α is sensitivity coefficient (0＜α＜1), and α is big more, and the sensitivity of energy measuring is high more, and promptly easy more is the noise flase drop voice; Otherwise α is more little, and easy more is the voice erroneous judgement noise.Definite method of α is: if E 〉=E _x, α=α then _MaxIf E＜E _x, α=α then _MinWherein, α _Min＜α _Maxα _Min, α _MaxAnd E _xAll can set in advance according to recognizer input signal energy range, irrelevant with environment for use.Like this, when environmental noise power is big, improve detection sensitivity automatically, prevent the omission voice; When neighbourhood noise hour, desensitization prevents too much false-alarm.In addition, set the interval of Thd, and value is limited in the scope according to the input range of speech recognition device.

Energy threshold Thd carries out energy measuring after determining.Its flow process is: if t is arranged in the current cache _{Start_energy}Millisecond (t wherein _{Start_energy}Be constant, 80＜t _{Start_energy}＜150) average energy value of continuous signal is greater than threshold value Thd, then mark wherein the 1st frame be T _Start, trigger harmonic wave and detect (promptly entering step 300); Otherwise, if do not satisfy the signal segment of this condition in the buffer memory, then abandon the 1st content frame in the buffer memory, buffer memory the 2nd content to the L frame is moved forward a frame, a new frame signal is deposited in buffer memory L frame, and the r frame signal is upgraded energy threshold (promptly entering step 700) before in the buffer memory after the utilization renewal.Every input one frame new signal just upgrades buffer memory and threshold value, up to finding T _StartTill;

Step 300: the signal in the buffer memory is carried out harmonic wave detect, judge whether there is voiced sound in the buffer memory, if the judgment is Yes, enter step 400; If the judgment is No, then enter step 700.The specific implementation process of this step will describe in detail hereinafter.

Step 400: precise search voice starting point in buffer memory.It is mainly operated: from T _StartSearch for a frame signal that satisfies following condition in turn forward: from this frame, signal energy increases progressively frame by frame, and then the former frame of this frame signal just is judged to the voice starting point.

Step 700: upgrade threshold value according to testing result and current cache.The renewal threshold value is a tracking noise of the present invention, the main path of the variation that conforms.Its specific practice is: every input one frame signal, buffer memory is with regard to mobile update one frame.The average energy value of r frame signal before calculating in the buffer memory

\overset{&OverBar;}{E_{new}} = \frac{1}{r} Σ_{i = 1}^{r} E_{i}

With the normalization mean square deviation

d_{new} = σ_{new} / \overset{&OverBar;}{E_{new}} = {(\frac{1}{r} Σ_{i = 1}^{r} {(E_{i} - \overset{&OverBar;}{E_{new}})}^{2})}^{1 / 2} / \overset{&OverBar;}{E_{new}},

And according to different situations with following logical renewal E

And d:

1. if the preceding r frame of current cache is a stationary noise, and with original noise similar performance, then its average and normalization mean square deviation change not quite, corresponding update strategy is: if (E _New＜Ec _UpdateAnd d _New＜dc _Update), E=E then _New, d=d _NewC wherein _UpdateBe that noise upgrades constant, 1＜c _Update＜1.5.

2. if do not satisfy above-mentioned condition, but satisfy E _New＜E10 and d _New＜d, then E=E _New, and d=d _NewIn this case, possible r frame is a stationary noise, but slightly different with former noise situations, so E _NewWith E gap is arranged, but d _NewStill can be less, and can be obviously greater than d.Suddenly increase and maintain the step noise of higher state for energy, this update strategy also helps system to follow the tracks of its energy as early as possible;

3. if do not satisfy condition 1 and 2, but satisfy E _New＜E10, d _New＜1 and d _New＜d1.5, then E=(E _New+ E)/2, d=(d _New+ d)/2, this strategy can prevent common noise renewal " deadlock " situation, helps to follow the tracks of as early as possible step noise equally.

4. under other situation, current r frame comprises nonstationary noise, does not carry out noise and upgrades.If comprise the switching point of two sections stationary noises in this r frame just, and two sections noise energies differ bigger, and along with the progressively renewal of buffer memory, preceding r frame can be updated to new stationary noise gradually, and satisfy the 2nd or the 3rd kind of situation.Like this, system has realized the tracking to stationary noise with the time-delay of maximum r frames.

According to average and the normalization mean square deviation after upgrading, can calculate the threshold value that makes new advances according to Thd=E+Ed/ α, and enter step 200, in current cache, carry out energy measuring.

Step 500: after detecting the voice starting point, according to the energy measuring voice terminal point of signal in the buffer memory.Testing process is: if a segment length is arranged in the buffer memory is t _{End_energy}The continuous signal of millisecond (t wherein _{End_energy}Be constant, scope is 60＜t _{End_energy}＜120), its every frame energy is all greater than end point energy threshold value Thd _End, then upgrade buffer memory and threshold value; Bring in constant renewal in buffer memory and threshold value Thd _End, in buffer memory, there is not such signal segment, at this moment buffer memory first frame is judged to the voice terminal point.

Because the calculated amount of energy measuring is less, so can guarantee system effectiveness by energy judgement voice terminal point.In addition, this algorithm is still brought in constant renewal in noise in voice segments, and when neighbourhood noise weakened, threshold value also decreased, and this helps to prevent shifting to an earlier date of terminal point judgement, reduces identification error.

Energy threshold Thd _EndDetermine by noise energy average E and normalization mean square deviation d, be expressed as: Thd _End=E+Ed.Buffer memory whenever moves a frame, uses average energy value equally

\overset{&OverBar;}{E_{new}} = \frac{1}{r} Σ_{i = 1}^{r} E_{i}

With the normalization mean square deviation

d_{new} = σ_{new} / \overset{&OverBar;}{E_{new}} = {(\frac{1}{r} Σ_{i = 1}^{r} {(E_{i} - \overset{&OverBar;}{E_{new}})}^{2})}^{1 / 2} / \overset{&OverBar;}{E_{new}}

Upgrade E and d respectively.More new logic is: if E _New≤ E and d _New≤ d, then E=(E _New+ E)/2, d=(d+d _New)/2.Like this, only, can guarantee energy voice erroneous judgement stably not to be noise detecting lower or more stably during noise, just carrying out energy threshold and upgrade.For speech recognition system, the low-yield part of cutting away the voice tail end may cause knowing by mistake or refusing knowing.Because noise still constantly changes when voice occur, if the noise information when still detecting with starting point then may be mistakened as the low-yield part of voice tail end behind the acoustic noise reducing and make noise, thereby influence voice integrity and reduce discrimination.And this more new logic can prevent this problem owing to also constantly detect new noise in voice segments, guarantee the complete of voice segments.

Step 600: output voice starting point and terminal point, the sound end testing process finishes.

Be described in detail the implementation process of step 300 in the present embodiment below.

Tentatively declare in energy measuring and voice starting point T _StartAfter, this step is utilized the various features of voiced sound harmonic wave, the signal frame that search has the voiced sound harmonic wave in current cache, and find voiced sound in view of the above, realize that the voice starting point detects.

In the practical speech recognition system, input signal is in the signal to noise ratio (S/N ratio) difference of each frequency band, if near the noise energy the fundamental frequency is stronger, although the harmonic characteristic of signal clearly, detect but its fundamental tone is difficult, so be easy to be interfered by detecting the method that fundamental tone detects voiced sound.Equally, if detect the harmonic wave of voice at the full range band, then the frequency band that signal to noise ratio (S/N ratio) is low also can influence voiced sound and detects.Therefore, the present invention only searches for 5 harmonic waves the most clearly according to the signal concrete condition, perhaps only searches for fundamental tone and adjacent 3 harmonic waves, thereby avoids the low frequency band of signal to noise ratio (S/N ratio) automatically, can be when noise be strong steady operation, and insensitive to the variation of noise.

Because harmonic wave and fundamental tone have been concentrated the main energy of voiced sound, and harmonic frequency is the integral multiple of fundamental frequency, so there is equally distributed energy extreme value in pure voiced sound on frequency domain, and it equals fundamental frequency at interval.Even the voiced sound signal is subjected to sound pick-up outfit and interference of noise, generally also keep 4～5 equidistant energy extreme values at frequency domain at least, the main foundation that harmonic wave detects in Here it is the present embodiment.As shown in Figure 3, the step 300 in the present embodiment (being that harmonic wave detects step) comprises following substep:

Step 301: begin the L frame of digital voice signal in the buffer memory is detected, search has the signal of voiced sound harmonic characteristic.

Step 302: take out the frame signal (establish it and be i frame in the buffer memory) also do not carry out harmonic wave and to detect in buffer memory, carry out pre-service such as windowing and pre-emphasis according to system's concrete condition, zero padding is to the N point, (N 〉=F wherein, N=2 ^x, x is integer and x 〉=8), carry out leaf transformation in the N point discrete Fourier then, obtain discrete spectrum

X (i, bin) = Σ_{n = 0}^{N - 1} x (i, n) e^{- j (2 π / N) n \cdot bin},

Bin=0,1 .., N-1, wherein x (i, n) n sampled value of i frame signal in the expression buffer memory, X (i, bin) represent i frame in the buffer memory bin Fourier transform value (bin=0,1 ..., N-1).Delivery square, each frequency band energy ε _Bin=| X (i, bin) | ²Wherein bin is a frequency band number, 1≤bin≤N/2, the width R=sampling rate/N of each frequency band.

Step 303: the minimum value of frequency band energy is designated as ε _Min, at ε _Bin(wherein among 1＜bin＜N/2), if ε _BinSatisfy ε simultaneously _Bin＞ε _Bin-1, ε _Bin＞ε _Bin+1, and ε _Bin＞M ε _Min, mark ε then _BinBe the energy extreme value, it might be corresponding to harmonic wave or fundamental tone, also may be corresponding to the noise of chance or the minor swing of voice spectrum.Wherein M is an empirical constant, 5＜M＜20.If satisfy total u of the frequency band of this requirement, by frequency order from low to high, writing down its position (is ε _BinSubscript bin) I _k(k=1..u).Why require ε _Bin＞M ε _Min, be because when having harmonic wave in the signal, signal energy distributes uneven, ε _MinDiffer bigger, ε with the harmonic band energy _BinHave only and satisfy this condition, just may be corresponding to harmonic wave or fundamental tone.

Step 304: to possible fundamental tone, search approaches the The extreme value distribution of this fundamental tone most, and promptly extreme value and fundamental tone are complementary, and its objective is to determine to work as last frame signal may corresponding to which voiced sound.Human fundamental tone scope is 60～450Hz, if use pitch _αThe frequency band number of expression fundamental tone correspondence, then [60/R]≤pitch _α≤ [450/R].Wherein, the maximum integer less than this value is got in [] expression, and R is the width of each frequency band, i.e. frequency domain resolution (seeing step 302).Because the energy advantage of fundamental tone and harmonic wave is more obvious at low frequency, thus only the extreme value in 60～2000Hz scope is mated, corresponding to [60/R]≤I _k≤ [2000/R].

Therefore, the process of search is: fundamental tone pitch _αTraversal [60/R] all integers in [450/R] are searched for and pitch in being numbered [60/R] energy extreme point in [2000/R] scope respectively _αAnd the approaching point in harmonic wave position.Because voiced sound shows at least 4～5 equally distributed energy extreme points usually on frequency domain, so, if having 5 in the existing extreme point with pitch _αBe the point of spacing, its frequency band number F _M.Satisfy F _m≈ mpitch _α, wherein

m = 2,3, . . [\frac{2000}{R \cdot {pitch}_{α}}];

Perhaps there are 4 with pitch _αBe the point of spacing, its frequency band number F _mSatisfy F _m≈ mpitch _α, m=1 wherein, 2,3,4, then this frame signal may be pitch corresponding to fundamental tone _αVoiced sound; Otherwise this frame signal can not be corresponding to this voiced sound.

For this reason, to current selected tests fundamental tone pitch _α, from I _k(k=1..u) search and mpitch in _αImmediate value I _{M '}, I _{M '}Satisfy | I _{M '}-mpitch _α|≤| I _{M "}-mpitch _α| (1≤m '≤u, 1≤m "≤u, m " ≠ m ').All corresponding to mpitch _α(wherein

m = 1,2, . . [\frac{2000}{R \cdot {pitch}_{α}}]

I _{M '}(wherein

m^{'} = 1,2, . . [\frac{2000}{R \cdot {pitch}_{α}}])

Be designated as set { P ₁, P ₂, P ₃... .}, and increase an element P ₀=0, form set { P ₀, P ₁, P ₂, P ₃... .}.

Step 305: if the set puppet, { P ₀, P ₁, P ₂, P ₃... there are 5 continuous element P among the .} _t, P _T+1, P _T+2, P _T+3, P _T+4(wherein

t = 0,1, . . [\frac{2000}{R \cdot {pitch}_{α}} - 4]

) spacing equates, thinks that then The extreme value distribution and fundamental tone are complementary.Because frequency span is R, promptly frequency domain resolution can not be accomplished definitely accurately, so the exact value of fundamental tone can be expressed as the spacing average of these 5 elements.

Whether equate pair set { P for estimating 5 element spacings ₀, P ₁, P ₂, P ₃... any 5 the continuous elements among the .}, calculate its spacing D ₁=P _T+1-P _t, D ₂=P _T+2-P _T+1, D ₃=P _T+3-P _T+2, D ₄=P _T+4-P _T+3(wherein

t = 0,1, . . [\frac{2000}{R \cdot {pitch}_{α}} - 4]

)。Ask { D ₁, D ₂, D ₃, D ₄Average D, and, be expressed as its variance normalization

Var = \frac{Σ_{q = 1}^{4} {(D_{q} - \overset{&OverBar;}{D})}^{2}}{4 {\overset{&OverBar;}{D}}^{2}} .

Var is more little, and the possibility that then contains voiced sound is big more.Corresponding to same pitch _αA plurality of 5 element combinations are arranged, and select normalization variance minimum one group, record its spacing average is D _α, the normalization variance is Var _α

Step 306: if Var _α＜Var _Thd(Var wherein _ThdBe empirical value, 0.001＜Var _Thd＜0.003), thinks that then when last frame signal may contain fundamental tone be D _αVoiced sound, D wherein _αRepresent the accurate fundamental frequency of this group spacing correspondence, and enter step 307.If at set { P ₁, P ₂, P ₃... among the .}, corresponding to the spacing { D of any 5 element combinations ₁, D ₂, D ₃, D ₄All do not satisfy Var＜Var _Thd, then enter step 310.

Step 307: judge D _αOther accurate fundamental frequency of having preserved with present frame is very approaching.Process is as follows:

At fundamental tone pitch _αIn the search procedure of interior all integers of traversal [60/R] to [450/R], if fundamental tone pitch was once arranged _βAlso corresponding to uniform The extreme value distribution, i.e. pitch _αAnd pitch _βCorresponding Var _αAnd Var _βAll less than the Var in the step 306 _Thd, Var then _βAnd D _βAll be saved.Check | D _α-D _β| whether greater than threshold value D _Pitch, D wherein _Pitch=D _Min/ R, D _MinBe empirical value (a 50＜D _Min＜150), unit is Hz.

If | D _α-D _β| 〉=D _Pitch, then enter step 308.

If | D _α-D _β|＜D _Pitch, then enter step 309.

Step 308: at current accurate fundamental frequency D _αPairing spacing normalization variance is enough little, and all accurate fundamental frequencies enter this step all at a distance of enough greatly the time with other.Write down current apart from average D _aWith normalization variance Var _αAll pitch in [60/R] to [450/R] _αIn, if α=α ₁, α ₂, α ₃.... can both correspond to equally distributed extreme value, then keep their normalization variance and its accurate fundamental frequency, be designated as VAR={Var1, Var2....} and D _All={ D ₁, D ₂....This is because when input signal was the very low voiced sound of fundamental frequency, 2 frequencys multiplication of fundamental tone even 3 frequencys multiplication also all in human fundamental frequency scope, promptly were also included within pitch _αThe hunting zone in, and its corresponding extreme value spacing may be more even.So not necessarily corresponding real fundamental tone of normalization variance minimal data.Voiced sound detection algorithm commonly used is often by following the tracks of fundamental frequency location voiced sound, thereby is subjected to the interference of fundamental tone frequency multiplication easily and lost efficacy, and this method keeps multi-group data and in the end screens in conjunction with the front and back frame, can avoid fundamental tone frequency multiplication interference problem.

Step 309: in step 307 | D _α-D _β|＜D _PitchThe time enter this step.In two approaching accurate fundamental frequencies of frequency, only keep less one of wherein normalization variance.If i.e.: Var _α＜Var _β, then from set VAR and D _AllIn leave out D _βAnd Var _βRecord, and Var _αAnd D _αRecord wherein; Otherwise, then keep D _βAnd Var _βRecord, and do not write down D _αRelevant information.This is because if two very approaching fundamental tones all may exist, then may be because pitch _αBe subjected to spectral resolution restriction and coarse cause, select more uniform one group of The extreme value distribution, can keep the fundamental tone of approaching reality.

Step 310: judge whether to travel through all possible fundamental tone (i.e. [60/R]≤pitch _αAll integers in≤[450/R] scope), if the judgment is Yes, enter step 311; If the judgment is No, get back to step 304.

Step 311: judge whether that all frames are all handled in the buffer memory.Every frame signal is at process [60/R]≤pitch _αAfter the test of all fundamental tones among≤[450/R] (be each possible fundamental tone and search for the extreme value of mating most in current all extreme values), that finally remain is VAR and D _AllThe frame that has is not because all have satisfactory The extreme value distribution to any fundamental tone, possible its VAR and D _AllIt all is empty set.If also have signal not pass through the extremum search of step 302 to 310 in the buffer memory and, then return step 302,, then enter step 312 if all frames are all searched for and tested in the buffer memory to test that may fundamental tone.

Step 312: beginning is in conjunction with several frame informations in front and back, and whether the preliminary judgement harmonic wave exists.At first setting the first frame number t that detects is 1.

Step 313: detect t and whether can connect into harmonic wave to the t+3 frame, i.e. whether judgement exists voiced sound.Judgment condition is: in continuous 4 frame signals, if the D of any adjacent two frames _AllIn all have an element very approaching at least with the other side's element, then can think to have harmonic wave in this 4 frame signal.Specifically, as long as adjacent two frames D separately _AllIn have an element separately at least with the other side's D _AllIn certain element difference and among both the ratio of smaller value be no more than constant ratio, think that then they meet the requirements.The span of ratio is 10%～20%.Fig. 4 is the rough schematic that connects.As seen, in 4 frames (i.e. t, t+1, t+2 and t+3 frame), as long as any two continuous frames all has a paths to be communicated with, just can declare this 4 frame is voiced sound.

Why requiring must be communicated with between two frames, is because the fundamental frequency of voiced sound and harmonic wave all are gradual continuously, does not have sudden change between two frames.Like this, even certain segment signal equally distributed energy extreme value occurs because of interference, can not be communicated with to get up continuously mutually yet, thereby can not be judged to voiced sound.In addition, so the normalization variance that the fundamental tone frequency multiplication of some voiced sound is also corresponding very little is the D of each frame _AllIn may comprise fundamental frequency and frequency multiplication thereof respectively, therefore,, promptly in two frames of front and back a connecting path is arranged as long as fundamental frequency or frequency multiplication can be communicated with, just can think that the fundamental frequency of this two frame and harmonic wave are continuous, this has just prevented failing to judge of real voiced sound.

Step 314: the start frame numbering t that judges harmonic wave is increased by 1, and continuing to detect down, whether 4 frames are communicated with.

Step 315: judge whether 4 all in buffer memory successive frames all to be done the harmonic wave connection, that is,, then enter step 316, otherwise return step 313 if tested the L-3 frame that is over to continuous 4 frames of L frame.

Step 316: because voiced sound is continuously gradual, and the fundamental purpose of end-point detection algorithm is to detect voice (being the summation of voiceless sound and voiced sound), thus when two sections voiced sounds relatively near the time, can think that the centre also is voice for one section, and be likely voiced sound.Therefore, information before and after utilizing is done further connection and shaping to declaring the harmonic wave that in the buffer memory, and being in the buffer memory between two sections voiced sounds, and time span all is judged to voiced sound less than 60 milliseconds anharmonic wave band.

Step 317: output is to the voiced sound testing result of current cache, and the harmonic wave testing process finishes.

Claims

1, a kind of sound end detecting method based on energy and harmonic wave is characterized in that, comprises the steps:

1) digitized sound signal to input divides frame and carries out windowing, pre-emphasis, and every frame signal comprises F sampled point, and wherein F is an integer;

2) adopt one section buffer memory, preserve top L frame signal, wherein L is that time span greater than 25 integer and L frame is greater than 200 milliseconds;

31) r frame signal estimated noise energy before utilizing in the buffer memory, wherein 7＜r＜13 and r are integer; Use symbol E _iRepresent i frame signal energy, r frame signal average energy value before calculating

\overset{&OverBar;}{E} = \frac{1}{r} Σ_{i = 1}^{r} E_{i}

And mean square deviation

σ = {(\frac{1}{r} Σ_{i = 1}^{r} {(E_{i} - \overset{&OverBar;}{E})}^{2})}^{1 / 2},

And mean square deviation carried out normalization, d=σ/E, E and d as initial estimation to noise profile;

33) after energy threshold Thd determines, carry out energy measuring; If t is arranged in the buffer memory _{Start_energy}The average energy value of millisecond continuous signal is greater than Thd, and then the 1st frame of this section of mark continuous signal is T _Start, and trigger harmonic wave detection, wherein t _{Start_energy}Be constant, unit is a millisecond, and scope is 80＜t _{Start_energy}＜150; If do not satisfy the signal segment of above-mentioned condition in the buffer memory, then import a frame new signal, upgrade buffer memory and threshold value Thd, and search voice starting point in the buffer memory after renewal;

4) tentatively declare in energy measuring and voice starting point T _StartAfter, search has the signal of voiced sound harmonic characteristic from current cache, if searched voiced sound, then thinks to have voice in the buffer memory, enters step 5) the voice starting point is carried out further fine search; If do not search voiced sound, then delete T _StartMark upgrades buffer memory and energy threshold Thd, returns step 3);

5) precise search voice starting point in current cache; Comprise: from T _StartSearch for a frame signal that satisfies following condition in turn forward: from this frame, signal energy increases progressively frame by frame, and then the former frame of this frame signal just is judged to the voice starting point;

After described step 5) was finished, according to the energy measuring voice terminal point of signal in the buffer memory, testing process was: if a segment length is arranged in the buffer memory is t _{End_energy}The continuous signal of millisecond, wherein t _{End_energy}Be constant, unit is a millisecond, and scope is 60＜t _{End_energy}＜120, its every frame energy is all greater than end point energy threshold value Thd _End, then upgrade buffer memory; Bring in constant renewal in buffer memory and threshold value Thd _End, then first frame signal in the buffer memory is judged to the voice terminal point less than such signal segment up to search in buffer memory.

2, by the described sound end detecting method of claim 1, it is characterized in that in described step 3) and the step 4), the process of described renewal energy threshold comprises: the average energy value of r frame signal before calculating in the buffer memory based on energy and harmonic wave

\overset{&OverBar;}{E_{new}} = \frac{1}{r} Σ_{i = 1}^{r} E_{i}

With the normalization mean square deviation

d_{new} = σ_{new} / \overset{&OverBar;}{E_{new}} = {(\frac{1}{r} Σ_{i = 1}^{r} {(E_{i} - \overset{&OverBar;}{E_{new}})}^{2})}^{1 / 2} / \overset{&OverBar;}{E_{new}},

With following logical renewal E and d:

C) if do not satisfy condition a) and b), but satisfy E _New＜E10, d _New＜1 and d _New＜d1.5, then E=(E _New+ E)/2, d=(d _New+ d)/2, this strategy can prevent common noise renewal " deadlock " situation, helps to follow the tracks of as early as possible the step noise that energy is uprushed;

D) if above-mentioned condition does not satisfy, illustrate that this r frame signal is not a stationary noise, do not upgrade so do not carry out noise;

3, by the described sound end detecting method of claim 1, it is characterized in that described end point energy threshold value Thd based on energy and harmonic wave _EndDetermine by noise energy average E and normalization mean square deviation d, be expressed as: Thd _End=E+Ed; Buffer update is once used the average energy value of preceding r frame signal equally

\overset{&OverBar;}{E_{new}} = \frac{1}{r} Σ_{i = 1}^{r} E_{i}

With the normalization mean square deviation

d_{new} = σ_{new} / \overset{&OverBar;}{E_{new}} = {(\frac{1}{r} Σ_{i = 1}^{r} {(E_{i} - \overset{&OverBar;}{E_{new}})}^{2})}^{1 / 2} / \overset{&OverBar;}{E_{new}}

4, by the described sound end detecting method of claim 1, it is characterized in that described step 33 based on energy and harmonic wave) in, every input one frame new signal just upgrades buffer memory one time, and upgrades threshold value Thd thereupon, up to finding T _StartTill; The process of upgrading buffer memory is: abandon the 1st content frame in the buffer memory, buffer memory the 2nd content to the L frame is moved forward a frame, a new frame signal is deposited in buffer memory L frame.