CN100580770C - Voice end detection method based on energy and harmonic - Google Patents

Voice end detection method based on energy and harmonic Download PDF

Info

Publication number
CN100580770C
CN100580770C CN200510089957A CN200510089957A CN100580770C CN 100580770 C CN100580770 C CN 100580770C CN 200510089957 A CN200510089957 A CN 200510089957A CN 200510089957 A CN200510089957 A CN 200510089957A CN 100580770 C CN100580770 C CN 100580770C
Authority
CN
China
Prior art keywords
energy
new
frame
signal
buffer memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200510089957A
Other languages
Chinese (zh)
Other versions
CN1912993A (en
Inventor
国雁萌
付强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN200510089957A priority Critical patent/CN100580770C/en
Publication of CN1912993A publication Critical patent/CN1912993A/en
Application granted granted Critical
Publication of CN100580770C publication Critical patent/CN100580770C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

This invention relates to a phone end point test method based on energies and harmonic waves including the following steps: pre-processing digitalized sound signals and storing them in a section of buffer storage, shifting and updating the buffer storage and regulating the threshold value every time inputting a signal to judge the Tstart based on the energy, then searching for signals with sonant harmonic wave character from the buffer storage, if sonant is found, then the accurate Tstart is searched, on the contrary, the phone Tstart is searched based on the energy then to search for the phone terminal point according to the signal energy. Advantages: this invention can adjust accuracy based on noise strength so it is adaptive to the S/N ratio of input signals and energy test sphere is rather wide so phones with weak energy will not be omitted.

Description

Sound end detecting method based on energy and harmonic wave
Technical field
The present invention relates to the automatic speech recognition field, particularly a kind of sound end detecting method.
Background technology
The input signal of automatic speech recognition system normally has the voice of noise, for preventing that the signal segment that does not contain voice from entering recognizer, to guarantee system performance and to reduce computing cost that need detect the starting point and the terminal point of user speech in signal, this process is called end-point detection.
Usual end point detection algorithm can be divided into rule-based and based on model two classes.The energy of rule-based method general using signal, zero-crossing rate, cepstrum, feature calculation such as spectrum estimations goes out distance when long, and by apart from the comparison and the logical operation of threshold value, determine whether voice exist.Method based on model is generally set up model respectively at the statistical property of noise and voice, makes judgement according to likelihood score then.
The applied environment of automatic speech recognition system is complicated, therefore, end-point detection must have extensively adaptability reliably, this comprises: adapt to the gradual noise of various intensity and kind, adapt to the intensity of noise and kind rapidly and change, not influenced by of short duration very noisy, and keep stable accuracy, operation efficiency and time delay in all cases.But in the noise complex environment, not only noise itself does not have fixing feature, and phonetic feature is also often by noise takeover.Method based on model often only is applicable to specific environment, and both has been subjected to the signal to noise ratio (S/N ratio) restriction easily based on the method for single feature-set rule, again noise type is changed relatively more responsive.Therefore, a lot of people improve the reliability of end-point detection by a plurality of features are combined.For example utilize energy overlaying relation and the cepstrum feature that proposed in 2002 of people such as Sahar E.Bou-Ghazale carries out the method for end-point detection, this method can be in the stationary noise environment steady operation, but this method is very sensitive to the noise energy sudden change, and because the cepstrum feature of voice can be affected by noise, thereby can not be used for the situation of low signal-to-noise ratio.The end-point detecting method that signal energy and voiced sound feature are combined that proposed in 2003 of Arnaud Martin and for example, the shortcoming of this method is in the sudden change that is difficult to the tracking noise energy, in addition, it needs the fundamental frequency of first detection signal could determine voiced sound, so be subjected to the interference of fundamental tone frequency multiplication easily, simultaneously, this algorithm has used the comb filter on the frequency domain, thereby also searches for voiced sound than difficulty in complicated noise.
Summary of the invention
The purpose of this invention is to provide a kind of the combined end-point detecting method of the voiced sound harmonic characteristic of signal energy and voice, this method is applicable to most of voice and noise circumstance, can not only be under multiple noise type and intensity steady operation, and can adapt to the unexpected variation of noise type and intensity rapidly.
For achieving the above object, the sound end detecting method based on energy and harmonic wave provided by the invention comprises the steps:
1) digitized sound signal to input divides frame and carries out pre-service such as windowing, pre-emphasis, and every frame signal comprises F sampled point, and length is about 25 milliseconds, overlaps about 15 milliseconds between consecutive frame;
2) adopt one section buffer memory, preserve top L frame (L>25, the time span of L frame is greater than 200 milliseconds) signal;
3) by the energy size of signal in the buffer memory voice starting point is carried out Preliminary detection; Comprise:
31) r frame signal estimated noise energy, wherein 7<r<13 before utilizing in the buffer memory; Use symbol E iRepresent i frame signal energy, r frame signal average energy value before calculating E ‾ = 1 r Σ i = 1 r E i And mean square deviation σ = ( 1 r Σ i = 1 r ( E i - E ‾ ) 2 ) 1 / 2 , And mean square deviation carried out normalization, i.e. d=σ/E, E and d as initial estimation to noise profile;
32) according to noise energy, set energy threshold Thd, computing method are: Thd=E+ σ/α=E+Ed/ α, wherein α is a sensitivity coefficient, 0<α<1;
33) after energy threshold Thd determines, carry out energy measuring; If t is arranged in the buffer memory Start_energyThe average energy value of millisecond continuous signal is greater than Thd, and then the 1st frame of this section of mark continuous signal is T Start, and trigger harmonic wave detection, wherein t Start_energyBe constant, unit is a millisecond, and scope is 80<t Start_energy<150; If do not satisfy the signal segment of above-mentioned condition in the buffer memory, then import a frame new signal, upgrade buffer memory and threshold value Thd, and search voice starting point in the buffer memory after renewal; Every input one frame new signal just upgrades buffer memory one time, and upgrades threshold value Thd thereupon, up to finding T StartTill; The process of upgrading buffer memory is: abandon the 1st content frame in the buffer memory, buffer memory the 2nd content to the L frame is moved forward a frame, a new frame signal is deposited in buffer memory L frame;
4) tentatively declare in energy measuring and voice starting point T StartAfter, search has the signal of voiced sound harmonic characteristic from current cache, if searched voiced sound, then thinks to have voice in the buffer memory, enters step 5) the voice starting point is carried out further fine search; If do not search voiced sound, then delete T StartMark, and get back to step 3);
5) precise search voice starting point in current cache; Comprise: from T StartSearch for a frame signal that satisfies following condition in turn forward: from this frame, signal energy increases progressively frame by frame, and then the former frame of this frame signal just is judged to the voice starting point.
In the technique scheme, after described step 5) was finished, according to the energy measuring voice terminal point of signal in the buffer memory, testing process was: if a segment length is arranged in the buffer memory is special t End_energyThe continuous signal of millisecond (t wherein End_energyBe constant, unit is a millisecond, and scope is 60<t End_energy<120), its every frame energy is all greater than end point energy threshold value Thd End, then upgrade buffer memory; Bring in constant renewal in buffer memory and threshold value Thd End, then first frame signal in the buffer memory is judged to the voice terminal point less than such signal segment up to search in buffer memory.
In the technique scheme, in described step 3) and the step 4), the process of described renewal energy threshold comprises: the average energy value of r frame signal before calculating in the buffer memory E new ‾ = 1 r Σ i = 1 r E i With the normalization mean square deviation d new = σ new / E new ‾ = ( 1 r Σ i = 1 r ( E i - E new ‾ ) 2 ) 1 / 2 / E new ‾ , With following logical renewal E and d:
A) if E New<Ec UpdateAnd d New<dc Update, E=E then New, d=d NewC wherein UpdateBe that noise upgrades constant, 1<c Update<1.5, this strategy can guarantee that system follows the tracks of stationary noise at any time;
B), but satisfy E if do not satisfy condition a) New<E10 and d New<d, then E=E New, and d=d New, this update strategy helps to adapt to as early as possible the noise that step changes;
C) if do not satisfy condition a) and b), but satisfy E New<E10, d New<1 and d New<d1.5, then E=(E New+ E)/2, d=(d New+ d)/2, this strategy can prevent common noise renewal " deadlock " situation, helps to follow the tracks of as early as possible the step noise that energy is uprushed.
D) if above-mentioned condition does not satisfy, illustrate that this r frame signal is not a stationary noise, do not upgrade so do not carry out noise.
After having upgraded E and d, calculate the threshold value that makes new advances with formula Thd=E+Ed/ α.
In the technique scheme, described end point energy threshold value Thd EndDetermine by noise energy average E and normalization mean square deviation d, be expressed as: Thd End=E+Ed; Buffer update is once used the average energy value of preceding r frame signal equally E new ‾ = 1 r Σ i = 1 r E i With the normalization mean square deviation d new = σ new / E new ‾ = ( 1 r Σ i = 1 r ( E i - E new ‾ ) 2 ) 1 / 2 / E new ‾ Upgrade E and d respectively; More new logic is: if E New≤ E and d New≤ d, then E=(E New+ E)/2, d=(d+d New)/2.
Advantage of the present invention is: at first, when the energy measuring initialization, introduced sensitivity coefficient, and can be automatically according to the noise intensity degree of regulation; Secondly, because energy measuring is positioned at the front end that the voiced sound harmonic wave detects in this method, and both be " with " logical relation, so energy threshold is established lowlyer, both filtered most of noise, the voice erroneous judgement can be noise again, this just can prevent from the more weak part of voice zero-point energy is cut away; In addition, upgrade owing to adopted one section buffer memory, every input one frame signal just to carry out noise, and on the threshold value update strategy, allow the sudden change of noise energy, change so this method can adapt to noise rapidly, thus stronger to the adaptability of noise.The present invention has certain adaptive ability to the intensity size of system input signal, and to age of speaker, sex with speak custom all without limits.Special advantage of the present invention is that also changing for the interchannel noise step in the telecommunication system has stronger adaptability.
Description of drawings
Fig. 1 is a principle flow chart of the present invention;
Fig. 2 is one section voice time-domain diagram and sound spectrograph of being subjected to noise;
Fig. 3 is harmonic wave testing process figure;
Fig. 4 is the harmonic wave judgement synoptic diagram of continuous 4 frames.
Embodiment
Principle of the present invention is as follows:
At first detect preliminary judgement voice starting point T by adaptive energy Start, enter harmonic wave then and detect step, if find to exist in the buffer memory harmonic structure of voiced sound, then judge to have voiced sound in the buffer memory, and at T StartNear precise search voice starting point; If do not find harmonic structure, then continue with energy measuring search voice starting point.After finding the voice starting point, detect search voice terminal point by adaptive energy.
Because voice and noise are the relation of stack on energy, so when noise energy changed relatively slowly, the appearance of voice can make signal energy increase suddenly.Energy measuring can constantly be followed the tracks of gradual noise, and the zone of finding energy to uprush is designated as the position that voice may occur with it.But burst noise also can make signal energy change, and may cause the energy measuring erroneous judgement, does further judgement so the present invention detects with harmonic wave, removes these mistakes.
Harmonic wave is the notable feature of voiced sound, and it shows as in sound spectrograph with the fundamental frequency is the even bright band of spacing.Because voice are made of voiceless sound and voiced sound combination, and the energy of voiced sound and duration are much larger than voiceless sound, so any significant voice segments all must contain long voiced sound.Because voiced energy mainly concentrates on fundamental frequency and harmonic wave, so even noise is very strong, noisy voiced sound still can the clear harmonic wave of retaining part (as shown in Figure 2).If in longer signal segment (>200 milliseconds), all there is not the harmonic structure of voiced sound, can think that then current demand signal does not contain voice.
Below in conjunction with accompanying drawing and preferred embodiment the present invention is further described.
Embodiment:
Sound end detecting method based on energy and harmonic wave provided by the invention comprises 3 basic steps: energy measuring, harmonic wave detect, the voice end point determination.Input signal is a digitized sound signal, is divided into isometric (every frame signal comprises F sampled point, and time span is about 25 milliseconds) and about 15 milliseconds frame that overlaps mutually, and uses L frame buffer (L>25, the time span of L frame signal is greater than 200 milliseconds).Specify each step workflow of the present invention below.
As shown in Figure 1, the present invention includes following steps:
Step 100: set the L frame buffer, preceding L frame input signal is deposited in the buffer memory, begin to carry out end-point detection.Every input one frame signal, the buffer memory automatic shift upgrades.
Step 200: according to energy Preliminary detection voice starting point.That is, set a signal energy threshold value Thd, judge whether there is voice signal in the current cache, if the judgment is Yes, then tentatively determine voice starting point T by the energy of signal in the buffer memory according to noise situations Start, enter step 300; If the judgment is No, then enter step 700.The specific implementation process of this step is as follows:
The every frame signal of L frame of digital voice signal in the buffer memory is done pre-service (according to system's actual conditions, can comprise windowing, pre-emphasis etc.) respectively, and establishing every frame length is the F point, and first zero padding is to N point (N 〉=F wherein, N=2 x, x is integer and x 〉=8), carry out leaf transformation in the N point discrete Fourier, obtain discrete spectrum X ( i , bin ) = Σ n = 0 N - 1 x ( i , n ) e - j ( 2 π / N ) n · bin , Wherein x (i, n) n sampled point of i frame in the expression buffer memory, X (i, bin) represent i frame in the buffer memory bin Fourier transform value (bin=0,1 .., N-1).Through equally distributed bank of filters on the Mel scale (for example p quarter window wave filter, half overlaps between adjacent two filter), obtain the energy of each subband E i ( j ) = ln { Σ bin = L j bin = H j T j ( bin ) | X ( i , bin ) | 2 } , (j=0,1,2...p-1)。
Wherein, j is the numbering of sub-filter, and p is the Methods of Subband Filter Banks number, T j(bin) be the frequency response of j wave filter, L jAnd H jBe respectively the initial frequency and the cutoff frequency of j wave filter.To the sub belt energy summation, obtain i frame signal energy at last E i = Σ j = 0 p - 1 E i ( j ) .
Utilize the preceding r frame (generally getting 7<r<13) of input signal to finish the initialization of energy measuring.Because the user always opens system earlier and just loquiturs, so can think that this r frame signal only contains noise, calculates their average energy value E ‾ = 1 r Σ i = 1 r E i And mean square deviation σ = ( 1 r Σ i = 1 r ( E i - E ‾ ) 2 ) 1 / 2 , And mean square deviation carried out normalization, i.e. d=σ/E, then E and d can be used as the initial estimation to noise profile.
The computing method of energy threshold Thd are: Thd=E+ σ/α=E+Ed/ α.Wherein α is sensitivity coefficient (0<α<1), and α is big more, and the sensitivity of energy measuring is high more, and promptly easy more is the noise flase drop voice; Otherwise α is more little, and easy more is the voice erroneous judgement noise.Definite method of α is: if E 〉=E x, α=α then MaxIf E<E x, α=α then MinWherein, α Min<α Maxα Min, α MaxAnd E xAll can set in advance according to recognizer input signal energy range, irrelevant with environment for use.Like this, when environmental noise power is big, improve detection sensitivity automatically, prevent the omission voice; When neighbourhood noise hour, desensitization prevents too much false-alarm.In addition, set the interval of Thd, and value is limited in the scope according to the input range of speech recognition device.
Energy threshold Thd carries out energy measuring after determining.Its flow process is: if t is arranged in the current cache Start_energyMillisecond (t wherein Start_energyBe constant, 80<t Start_energy<150) average energy value of continuous signal is greater than threshold value Thd, then mark wherein the 1st frame be T Start, trigger harmonic wave and detect (promptly entering step 300); Otherwise, if do not satisfy the signal segment of this condition in the buffer memory, then abandon the 1st content frame in the buffer memory, buffer memory the 2nd content to the L frame is moved forward a frame, a new frame signal is deposited in buffer memory L frame, and the r frame signal is upgraded energy threshold (promptly entering step 700) before in the buffer memory after the utilization renewal.Every input one frame new signal just upgrades buffer memory and threshold value, up to finding T StartTill;
Step 300: the signal in the buffer memory is carried out harmonic wave detect, judge whether there is voiced sound in the buffer memory, if the judgment is Yes, enter step 400; If the judgment is No, then enter step 700.The specific implementation process of this step will describe in detail hereinafter.
Step 400: precise search voice starting point in buffer memory.It is mainly operated: from T StartSearch for a frame signal that satisfies following condition in turn forward: from this frame, signal energy increases progressively frame by frame, and then the former frame of this frame signal just is judged to the voice starting point.
Step 700: upgrade threshold value according to testing result and current cache.The renewal threshold value is a tracking noise of the present invention, the main path of the variation that conforms.Its specific practice is: every input one frame signal, buffer memory is with regard to mobile update one frame.The average energy value of r frame signal before calculating in the buffer memory E new ‾ = 1 r Σ i = 1 r E i With the normalization mean square deviation d new = σ new / E new ‾ = ( 1 r Σ i = 1 r ( E i - E new ‾ ) 2 ) 1 / 2 / E new ‾ , And according to different situations with following logical renewal E
And d:
1. if the preceding r frame of current cache is a stationary noise, and with original noise similar performance, then its average and normalization mean square deviation change not quite, corresponding update strategy is: if (E New<Ec UpdateAnd d New<dc Update), E=E then New, d=d NewC wherein UpdateBe that noise upgrades constant, 1<c Update<1.5.
2. if do not satisfy above-mentioned condition, but satisfy E New<E10 and d New<d, then E=E New, and d=d NewIn this case, possible r frame is a stationary noise, but slightly different with former noise situations, so E NewWith E gap is arranged, but d NewStill can be less, and can be obviously greater than d.Suddenly increase and maintain the step noise of higher state for energy, this update strategy also helps system to follow the tracks of its energy as early as possible;
3. if do not satisfy condition 1 and 2, but satisfy E New<E10, d New<1 and d New<d1.5, then E=(E New+ E)/2, d=(d New+ d)/2, this strategy can prevent common noise renewal " deadlock " situation, helps to follow the tracks of as early as possible step noise equally.
4. under other situation, current r frame comprises nonstationary noise, does not carry out noise and upgrades.If comprise the switching point of two sections stationary noises in this r frame just, and two sections noise energies differ bigger, and along with the progressively renewal of buffer memory, preceding r frame can be updated to new stationary noise gradually, and satisfy the 2nd or the 3rd kind of situation.Like this, system has realized the tracking to stationary noise with the time-delay of maximum r frames.
According to average and the normalization mean square deviation after upgrading, can calculate the threshold value that makes new advances according to Thd=E+Ed/ α, and enter step 200, in current cache, carry out energy measuring.
Step 500: after detecting the voice starting point, according to the energy measuring voice terminal point of signal in the buffer memory.Testing process is: if a segment length is arranged in the buffer memory is t End_energyThe continuous signal of millisecond (t wherein End_energyBe constant, scope is 60<t End_energy<120), its every frame energy is all greater than end point energy threshold value Thd End, then upgrade buffer memory and threshold value; Bring in constant renewal in buffer memory and threshold value Thd End, in buffer memory, there is not such signal segment, at this moment buffer memory first frame is judged to the voice terminal point.
Because the calculated amount of energy measuring is less, so can guarantee system effectiveness by energy judgement voice terminal point.In addition, this algorithm is still brought in constant renewal in noise in voice segments, and when neighbourhood noise weakened, threshold value also decreased, and this helps to prevent shifting to an earlier date of terminal point judgement, reduces identification error.
Energy threshold Thd EndDetermine by noise energy average E and normalization mean square deviation d, be expressed as: Thd End=E+Ed.Buffer memory whenever moves a frame, uses average energy value equally E new ‾ = 1 r Σ i = 1 r E i With the normalization mean square deviation d new = σ new / E new ‾ = ( 1 r Σ i = 1 r ( E i - E new ‾ ) 2 ) 1 / 2 / E new ‾ Upgrade E and d respectively.More new logic is: if E New≤ E and d New≤ d, then E=(E New+ E)/2, d=(d+d New)/2.Like this, only, can guarantee energy voice erroneous judgement stably not to be noise detecting lower or more stably during noise, just carrying out energy threshold and upgrade.For speech recognition system, the low-yield part of cutting away the voice tail end may cause knowing by mistake or refusing knowing.Because noise still constantly changes when voice occur, if the noise information when still detecting with starting point then may be mistakened as the low-yield part of voice tail end behind the acoustic noise reducing and make noise, thereby influence voice integrity and reduce discrimination.And this more new logic can prevent this problem owing to also constantly detect new noise in voice segments, guarantee the complete of voice segments.
Step 600: output voice starting point and terminal point, the sound end testing process finishes.
Be described in detail the implementation process of step 300 in the present embodiment below.
Tentatively declare in energy measuring and voice starting point T StartAfter, this step is utilized the various features of voiced sound harmonic wave, the signal frame that search has the voiced sound harmonic wave in current cache, and find voiced sound in view of the above, realize that the voice starting point detects.
In the practical speech recognition system, input signal is in the signal to noise ratio (S/N ratio) difference of each frequency band, if near the noise energy the fundamental frequency is stronger, although the harmonic characteristic of signal clearly, detect but its fundamental tone is difficult, so be easy to be interfered by detecting the method that fundamental tone detects voiced sound.Equally, if detect the harmonic wave of voice at the full range band, then the frequency band that signal to noise ratio (S/N ratio) is low also can influence voiced sound and detects.Therefore, the present invention only searches for 5 harmonic waves the most clearly according to the signal concrete condition, perhaps only searches for fundamental tone and adjacent 3 harmonic waves, thereby avoids the low frequency band of signal to noise ratio (S/N ratio) automatically, can be when noise be strong steady operation, and insensitive to the variation of noise.
Because harmonic wave and fundamental tone have been concentrated the main energy of voiced sound, and harmonic frequency is the integral multiple of fundamental frequency, so there is equally distributed energy extreme value in pure voiced sound on frequency domain, and it equals fundamental frequency at interval.Even the voiced sound signal is subjected to sound pick-up outfit and interference of noise, generally also keep 4~5 equidistant energy extreme values at frequency domain at least, the main foundation that harmonic wave detects in Here it is the present embodiment.As shown in Figure 3, the step 300 in the present embodiment (being that harmonic wave detects step) comprises following substep:
Step 301: begin the L frame of digital voice signal in the buffer memory is detected, search has the signal of voiced sound harmonic characteristic.
Step 302: take out the frame signal (establish it and be i frame in the buffer memory) also do not carry out harmonic wave and to detect in buffer memory, carry out pre-service such as windowing and pre-emphasis according to system's concrete condition, zero padding is to the N point, (N 〉=F wherein, N=2 x, x is integer and x 〉=8), carry out leaf transformation in the N point discrete Fourier then, obtain discrete spectrum X ( i , bin ) = Σ n = 0 N - 1 x ( i , n ) e - j ( 2 π / N ) n · bin , Bin=0,1 .., N-1, wherein x (i, n) n sampled value of i frame signal in the expression buffer memory, X (i, bin) represent i frame in the buffer memory bin Fourier transform value (bin=0,1 ..., N-1).Delivery square, each frequency band energy ε Bin=| X (i, bin) | 2Wherein bin is a frequency band number, 1≤bin≤N/2, the width R=sampling rate/N of each frequency band.
Step 303: the minimum value of frequency band energy is designated as ε Min, at ε Bin(wherein among 1<bin<N/2), if ε BinSatisfy ε simultaneously Bin>ε Bin-1, ε Bin>ε Bin+1, and ε Bin>M ε Min, mark ε then BinBe the energy extreme value, it might be corresponding to harmonic wave or fundamental tone, also may be corresponding to the noise of chance or the minor swing of voice spectrum.Wherein M is an empirical constant, 5<M<20.If satisfy total u of the frequency band of this requirement, by frequency order from low to high, writing down its position (is ε BinSubscript bin) I k(k=1..u).Why require ε Bin>M ε Min, be because when having harmonic wave in the signal, signal energy distributes uneven, ε MinDiffer bigger, ε with the harmonic band energy BinHave only and satisfy this condition, just may be corresponding to harmonic wave or fundamental tone.
Step 304: to possible fundamental tone, search approaches the The extreme value distribution of this fundamental tone most, and promptly extreme value and fundamental tone are complementary, and its objective is to determine to work as last frame signal may corresponding to which voiced sound.Human fundamental tone scope is 60~450Hz, if use pitch αThe frequency band number of expression fundamental tone correspondence, then [60/R]≤pitch α≤ [450/R].Wherein, the maximum integer less than this value is got in [] expression, and R is the width of each frequency band, i.e. frequency domain resolution (seeing step 302).Because the energy advantage of fundamental tone and harmonic wave is more obvious at low frequency, thus only the extreme value in 60~2000Hz scope is mated, corresponding to [60/R]≤I k≤ [2000/R].
Therefore, the process of search is: fundamental tone pitch αTraversal [60/R] all integers in [450/R] are searched for and pitch in being numbered [60/R] energy extreme point in [2000/R] scope respectively αAnd the approaching point in harmonic wave position.Because voiced sound shows at least 4~5 equally distributed energy extreme points usually on frequency domain, so, if having 5 in the existing extreme point with pitch αBe the point of spacing, its frequency band number F M.Satisfy F m≈ mpitch α, wherein m = 2,3 , . . [ 2000 R · pitch α ] ; Perhaps there are 4 with pitch αBe the point of spacing, its frequency band number F mSatisfy F m≈ mpitch α, m=1 wherein, 2,3,4, then this frame signal may be pitch corresponding to fundamental tone αVoiced sound; Otherwise this frame signal can not be corresponding to this voiced sound.
For this reason, to current selected tests fundamental tone pitch α, from I k(k=1..u) search and mpitch in αImmediate value I M ', I M 'Satisfy | I M '-mpitch α|≤| I M "-mpitch α| (1≤m '≤u, 1≤m "≤u, m " ≠ m ').All corresponding to mpitch α(wherein m = 1,2 , . . [ 2000 R · pitch α ] I M '(wherein m ′ = 1,2 , . . [ 2000 R · pitch α ] ) Be designated as set { P 1, P 2, P 3... .}, and increase an element P 0=0, form set { P 0, P 1, P 2, P 3... .}.
Step 305: if the set puppet, { P 0, P 1, P 2, P 3... there are 5 continuous element P among the .} t, P T+1, P T+2, P T+3, P T+4(wherein t = 0,1 , . . [ 2000 R · pitch α - 4 ] ) spacing equates, thinks that then The extreme value distribution and fundamental tone are complementary.Because frequency span is R, promptly frequency domain resolution can not be accomplished definitely accurately, so the exact value of fundamental tone can be expressed as the spacing average of these 5 elements.
Whether equate pair set { P for estimating 5 element spacings 0, P 1, P 2, P 3... any 5 the continuous elements among the .}, calculate its spacing D 1=P T+1-P t, D 2=P T+2-P T+1, D 3=P T+3-P T+2, D 4=P T+4-P T+3(wherein t = 0,1 , . . [ 2000 R · pitch α - 4 ] )。Ask { D 1, D 2, D 3, D 4Average D, and, be expressed as its variance normalization Var = Σ q = 1 4 ( D q - D ‾ ) 2 4 D ‾ 2 . Var is more little, and the possibility that then contains voiced sound is big more.Corresponding to same pitch αA plurality of 5 element combinations are arranged, and select normalization variance minimum one group, record its spacing average is D α, the normalization variance is Var α
Step 306: if Var α<Var Thd(Var wherein ThdBe empirical value, 0.001<Var Thd<0.003), thinks that then when last frame signal may contain fundamental tone be D αVoiced sound, D wherein αRepresent the accurate fundamental frequency of this group spacing correspondence, and enter step 307.If at set { P 1, P 2, P 3... among the .}, corresponding to the spacing { D of any 5 element combinations 1, D 2, D 3, D 4All do not satisfy Var<Var Thd, then enter step 310.
Step 307: judge D αOther accurate fundamental frequency of having preserved with present frame is very approaching.Process is as follows:
At fundamental tone pitch αIn the search procedure of interior all integers of traversal [60/R] to [450/R], if fundamental tone pitch was once arranged βAlso corresponding to uniform The extreme value distribution, i.e. pitch αAnd pitch βCorresponding Var αAnd Var βAll less than the Var in the step 306 Thd, Var then βAnd D βAll be saved.Check | D α-D β| whether greater than threshold value D Pitch, D wherein Pitch=D Min/ R, D MinBe empirical value (a 50<D Min<150), unit is Hz.
If | D α-D β| 〉=D Pitch, then enter step 308.
If | D α-D β|<D Pitch, then enter step 309.
Step 308: at current accurate fundamental frequency D αPairing spacing normalization variance is enough little, and all accurate fundamental frequencies enter this step all at a distance of enough greatly the time with other.Write down current apart from average D aWith normalization variance Var αAll pitch in [60/R] to [450/R] αIn, if α=α 1, α 2, α 3.... can both correspond to equally distributed extreme value, then keep their normalization variance and its accurate fundamental frequency, be designated as VAR={Var1, Var2....} and D All={ D 1, D 2....This is because when input signal was the very low voiced sound of fundamental frequency, 2 frequencys multiplication of fundamental tone even 3 frequencys multiplication also all in human fundamental frequency scope, promptly were also included within pitch αThe hunting zone in, and its corresponding extreme value spacing may be more even.So not necessarily corresponding real fundamental tone of normalization variance minimal data.Voiced sound detection algorithm commonly used is often by following the tracks of fundamental frequency location voiced sound, thereby is subjected to the interference of fundamental tone frequency multiplication easily and lost efficacy, and this method keeps multi-group data and in the end screens in conjunction with the front and back frame, can avoid fundamental tone frequency multiplication interference problem.
Step 309: in step 307 | D α-D β|<D PitchThe time enter this step.In two approaching accurate fundamental frequencies of frequency, only keep less one of wherein normalization variance.If i.e.: Var α<Var β, then from set VAR and D AllIn leave out D βAnd Var βRecord, and Var αAnd D αRecord wherein; Otherwise, then keep D βAnd Var βRecord, and do not write down D αRelevant information.This is because if two very approaching fundamental tones all may exist, then may be because pitch αBe subjected to spectral resolution restriction and coarse cause, select more uniform one group of The extreme value distribution, can keep the fundamental tone of approaching reality.
Step 310: judge whether to travel through all possible fundamental tone (i.e. [60/R]≤pitch αAll integers in≤[450/R] scope), if the judgment is Yes, enter step 311; If the judgment is No, get back to step 304.
Step 311: judge whether that all frames are all handled in the buffer memory.Every frame signal is at process [60/R]≤pitch αAfter the test of all fundamental tones among≤[450/R] (be each possible fundamental tone and search for the extreme value of mating most in current all extreme values), that finally remain is VAR and D AllThe frame that has is not because all have satisfactory The extreme value distribution to any fundamental tone, possible its VAR and D AllIt all is empty set.If also have signal not pass through the extremum search of step 302 to 310 in the buffer memory and, then return step 302,, then enter step 312 if all frames are all searched for and tested in the buffer memory to test that may fundamental tone.
Step 312: beginning is in conjunction with several frame informations in front and back, and whether the preliminary judgement harmonic wave exists.At first setting the first frame number t that detects is 1.
Step 313: detect t and whether can connect into harmonic wave to the t+3 frame, i.e. whether judgement exists voiced sound.Judgment condition is: in continuous 4 frame signals, if the D of any adjacent two frames AllIn all have an element very approaching at least with the other side's element, then can think to have harmonic wave in this 4 frame signal.Specifically, as long as adjacent two frames D separately AllIn have an element separately at least with the other side's D AllIn certain element difference and among both the ratio of smaller value be no more than constant ratio, think that then they meet the requirements.The span of ratio is 10%~20%.Fig. 4 is the rough schematic that connects.As seen, in 4 frames (i.e. t, t+1, t+2 and t+3 frame), as long as any two continuous frames all has a paths to be communicated with, just can declare this 4 frame is voiced sound.
Why requiring must be communicated with between two frames, is because the fundamental frequency of voiced sound and harmonic wave all are gradual continuously, does not have sudden change between two frames.Like this, even certain segment signal equally distributed energy extreme value occurs because of interference, can not be communicated with to get up continuously mutually yet, thereby can not be judged to voiced sound.In addition, so the normalization variance that the fundamental tone frequency multiplication of some voiced sound is also corresponding very little is the D of each frame AllIn may comprise fundamental frequency and frequency multiplication thereof respectively, therefore,, promptly in two frames of front and back a connecting path is arranged as long as fundamental frequency or frequency multiplication can be communicated with, just can think that the fundamental frequency of this two frame and harmonic wave are continuous, this has just prevented failing to judge of real voiced sound.
Step 314: the start frame numbering t that judges harmonic wave is increased by 1, and continuing to detect down, whether 4 frames are communicated with.
Step 315: judge whether 4 all in buffer memory successive frames all to be done the harmonic wave connection, that is,, then enter step 316, otherwise return step 313 if tested the L-3 frame that is over to continuous 4 frames of L frame.
Step 316: because voiced sound is continuously gradual, and the fundamental purpose of end-point detection algorithm is to detect voice (being the summation of voiceless sound and voiced sound), thus when two sections voiced sounds relatively near the time, can think that the centre also is voice for one section, and be likely voiced sound.Therefore, information before and after utilizing is done further connection and shaping to declaring the harmonic wave that in the buffer memory, and being in the buffer memory between two sections voiced sounds, and time span all is judged to voiced sound less than 60 milliseconds anharmonic wave band.
Step 317: output is to the voiced sound testing result of current cache, and the harmonic wave testing process finishes.

Claims (4)

1, a kind of sound end detecting method based on energy and harmonic wave is characterized in that, comprises the steps:
1) digitized sound signal to input divides frame and carries out windowing, pre-emphasis, and every frame signal comprises F sampled point, and wherein F is an integer;
2) adopt one section buffer memory, preserve top L frame signal, wherein L is that time span greater than 25 integer and L frame is greater than 200 milliseconds;
3) by the energy size of signal in the buffer memory voice starting point is carried out Preliminary detection; Comprise:
31) r frame signal estimated noise energy before utilizing in the buffer memory, wherein 7<r<13 and r are integer; Use symbol E iRepresent i frame signal energy, r frame signal average energy value before calculating E ‾ = 1 r Σ i = 1 r E i And mean square deviation σ = ( 1 r Σ i = 1 r ( E i - E ‾ ) 2 ) 1 / 2 , And mean square deviation carried out normalization, d=σ/E, E and d as initial estimation to noise profile;
32) according to noise energy, set energy threshold Thd, computing method are: Thd=E+ σ/α=E+Ed/ α, wherein α is a sensitivity coefficient, 0<α<1;
33) after energy threshold Thd determines, carry out energy measuring; If t is arranged in the buffer memory Start_energyThe average energy value of millisecond continuous signal is greater than Thd, and then the 1st frame of this section of mark continuous signal is T Start, and trigger harmonic wave detection, wherein t Start_energyBe constant, unit is a millisecond, and scope is 80<t Start_energy<150; If do not satisfy the signal segment of above-mentioned condition in the buffer memory, then import a frame new signal, upgrade buffer memory and threshold value Thd, and search voice starting point in the buffer memory after renewal;
4) tentatively declare in energy measuring and voice starting point T StartAfter, search has the signal of voiced sound harmonic characteristic from current cache, if searched voiced sound, then thinks to have voice in the buffer memory, enters step 5) the voice starting point is carried out further fine search; If do not search voiced sound, then delete T StartMark upgrades buffer memory and energy threshold Thd, returns step 3);
5) precise search voice starting point in current cache; Comprise: from T StartSearch for a frame signal that satisfies following condition in turn forward: from this frame, signal energy increases progressively frame by frame, and then the former frame of this frame signal just is judged to the voice starting point;
After described step 5) was finished, according to the energy measuring voice terminal point of signal in the buffer memory, testing process was: if a segment length is arranged in the buffer memory is t End_energyThe continuous signal of millisecond, wherein t End_energyBe constant, unit is a millisecond, and scope is 60<t End_energy<120, its every frame energy is all greater than end point energy threshold value Thd End, then upgrade buffer memory; Bring in constant renewal in buffer memory and threshold value Thd End, then first frame signal in the buffer memory is judged to the voice terminal point less than such signal segment up to search in buffer memory.
2, by the described sound end detecting method of claim 1, it is characterized in that in described step 3) and the step 4), the process of described renewal energy threshold comprises: the average energy value of r frame signal before calculating in the buffer memory based on energy and harmonic wave E new ‾ = 1 r Σ i = 1 r E i With the normalization mean square deviation d new = σ new / E new ‾ = ( 1 r Σ i = 1 r ( E i - E new ‾ ) 2 ) 1 / 2 / E new ‾ , With following logical renewal E and d:
A) if E New<Ec UpdateAnd d New<dc Update, E=E then New, d=d NewC wherein UpdateBe that noise upgrades constant, 1<c Update<1.5, this strategy can guarantee that system follows the tracks of stationary noise at any time;
B), but satisfy E if do not satisfy condition a) New<E10 and d New<d, then E=E New, and d=d New, this update strategy helps to adapt to as early as possible the noise that step changes;
C) if do not satisfy condition a) and b), but satisfy E New<E10, d New<1 and d New<d1.5, then E=(E New+ E)/2, d=(d New+ d)/2, this strategy can prevent common noise renewal " deadlock " situation, helps to follow the tracks of as early as possible the step noise that energy is uprushed;
D) if above-mentioned condition does not satisfy, illustrate that this r frame signal is not a stationary noise, do not upgrade so do not carry out noise;
After having upgraded E and d, calculate the threshold value that makes new advances with formula Thd=E+Ed/ α.
3, by the described sound end detecting method of claim 1, it is characterized in that described end point energy threshold value Thd based on energy and harmonic wave EndDetermine by noise energy average E and normalization mean square deviation d, be expressed as: Thd End=E+Ed; Buffer update is once used the average energy value of preceding r frame signal equally E new ‾ = 1 r Σ i = 1 r E i With the normalization mean square deviation d new = σ new / E new ‾ = ( 1 r Σ i = 1 r ( E i - E new ‾ ) 2 ) 1 / 2 / E new ‾ Upgrade E and d respectively; More new logic is: if E New≤ E and d New≤ d, then E=(E New+ E)/2, d=(d+d New)/2.
4, by the described sound end detecting method of claim 1, it is characterized in that described step 33 based on energy and harmonic wave) in, every input one frame new signal just upgrades buffer memory one time, and upgrades threshold value Thd thereupon, up to finding T StartTill; The process of upgrading buffer memory is: abandon the 1st content frame in the buffer memory, buffer memory the 2nd content to the L frame is moved forward a frame, a new frame signal is deposited in buffer memory L frame.
CN200510089957A 2005-08-08 2005-08-08 Voice end detection method based on energy and harmonic Expired - Fee Related CN100580770C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200510089957A CN100580770C (en) 2005-08-08 2005-08-08 Voice end detection method based on energy and harmonic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200510089957A CN100580770C (en) 2005-08-08 2005-08-08 Voice end detection method based on energy and harmonic

Publications (2)

Publication Number Publication Date
CN1912993A CN1912993A (en) 2007-02-14
CN100580770C true CN100580770C (en) 2010-01-13

Family

ID=37721904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200510089957A Expired - Fee Related CN100580770C (en) 2005-08-08 2005-08-08 Voice end detection method based on energy and harmonic

Country Status (1)

Country Link
CN (1) CN100580770C (en)

Families Citing this family (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101232366B (en) * 2006-10-04 2013-06-19 马维尔国际贸易有限公司 Mobile communication device and method for reducing fake burst checking used therebetween
CN101625857B (en) * 2008-07-10 2012-05-09 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method
CN101625860B (en) * 2008-07-10 2012-07-04 新奥特(北京)视频技术有限公司 Method for self-adaptively adjusting background noise in voice endpoint detection
US8380497B2 (en) * 2008-10-15 2013-02-19 Qualcomm Incorporated Methods and apparatus for noise estimation
CN101458943B (en) * 2008-12-31 2013-01-30 无锡中星微电子有限公司 Sound recording control method and sound recording device
CN101872616B (en) * 2009-04-22 2013-02-06 索尼株式会社 Endpoint detection method and system using same
CN102164328B (en) * 2010-12-29 2013-12-11 中国科学院声学研究所 Audio input system used in home environment based on microphone array
CN102436810A (en) * 2011-10-26 2012-05-02 华南理工大学 Record replay attack detection method and system based on channel mode noise
CN102592592A (en) * 2011-12-30 2012-07-18 深圳市车音网科技有限公司 Voice data extraction method and device
CN102568473A (en) * 2011-12-30 2012-07-11 深圳市车音网科技有限公司 Method and device for recording voice signals
CN103310800B (en) * 2012-03-06 2015-10-07 中国科学院声学研究所 A kind of turbid speech detection method of anti-noise jamming and system
CN102820035A (en) * 2012-08-23 2012-12-12 无锡思达物电子技术有限公司 Self-adaptive judging method of long-term variable noise
CN103824555B (en) * 2012-11-19 2015-11-18 腾讯科技(深圳)有限公司 Audio section extracting method and extraction element
CN103730124A (en) * 2013-12-31 2014-04-16 上海交通大学无锡研究院 Noise robustness endpoint detection method based on likelihood ratio test
CN103854662B (en) * 2014-03-04 2017-03-15 中央军委装备发展部第六十三研究所 Adaptive voice detection method based on multiple domain Combined estimator
CN104934032B (en) * 2014-03-17 2019-04-05 华为技术有限公司 The method and apparatus that voice signal is handled according to frequency domain energy
CN104091603B (en) * 2014-05-23 2017-06-09 普强信息技术(北京)有限公司 Endpoint detection system and its computational methods based on fundamental frequency
US20160109842A1 (en) * 2014-10-16 2016-04-21 Kodak Alaris Inc. Audio detection of medium jam
CN104464722B (en) * 2014-11-13 2018-05-25 北京云知声信息技术有限公司 Voice activity detection method and apparatus based on time domain and frequency domain
CN104639478B (en) * 2014-12-30 2017-12-26 无锡北邮感知技术产业研究院有限公司 A kind of signal detection for combining correction to variances and time domain positioning method and system
CN105609118B (en) * 2015-12-30 2020-02-07 生迪智慧科技有限公司 Voice detection method and device
CN106356076B (en) * 2016-09-09 2019-11-05 北京百度网讯科技有限公司 Voice activity detector method and apparatus based on artificial intelligence
CN106782573B (en) * 2016-11-30 2020-04-24 北京酷我科技有限公司 Method for generating AAC file through coding
CN108428448A (en) * 2017-02-13 2018-08-21 芋头科技(杭州)有限公司 A kind of sound end detecting method and audio recognition method
CN107799126B (en) * 2017-10-16 2020-10-16 苏州狗尾草智能科技有限公司 Voice endpoint detection method and device based on supervised machine learning
CN108550365B (en) * 2018-02-01 2021-04-02 云知声智能科技股份有限公司 Threshold value self-adaptive adjusting method for off-line voice recognition
CN108665889B (en) * 2018-04-20 2021-09-28 百度在线网络技术(北京)有限公司 Voice signal endpoint detection method, device, equipment and storage medium
CN108922513B (en) * 2018-06-04 2023-03-17 平安科技(深圳)有限公司 Voice distinguishing method and device, computer equipment and storage medium
CN109036470B (en) * 2018-06-04 2023-04-21 平安科技(深圳)有限公司 Voice distinguishing method, device, computer equipment and storage medium
CN108806725A (en) * 2018-06-04 2018-11-13 平安科技(深圳)有限公司 Speech differentiation method, apparatus, computer equipment and storage medium
CN109102823B (en) * 2018-09-05 2022-12-06 河海大学 Speech enhancement method based on subband spectral entropy
CN110880329B (en) * 2018-09-06 2022-11-04 腾讯科技(深圳)有限公司 Audio identification method and equipment and storage medium
CN109545191B (en) * 2018-11-15 2022-11-25 电子科技大学 Real-time detection method for initial position of human voice in song
CN109767782B (en) * 2018-12-28 2020-04-14 中国科学院声学研究所 Speech enhancement method for improving DNN model generalization performance
CN110444194B (en) * 2019-06-28 2021-08-20 北京捷通华声科技股份有限公司 Voice detection method and device
CN110401781B (en) * 2019-07-25 2021-04-02 上海掌学教育科技有限公司 False call detection system, method and medium
CN110689905B (en) * 2019-09-06 2021-12-21 西安合谱声学科技有限公司 Voice activity detection system for video conference system
CN110556128B (en) * 2019-10-15 2021-02-09 出门问问信息科技有限公司 Voice activity detection method and device and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020118648A1 (en) * 2001-02-15 2002-08-29 Olaf Zaencker Method and arrangement for testing the transmission system and method for quality of a speech transmission
EP1284551A1 (en) * 2001-08-08 2003-02-19 Siemens Aktiengesellschaft Assignment of a service quality for information transfer within a communication network
CN1427395A (en) * 2001-12-17 2003-07-02 中国科学院自动化研究所 Speech sound signal terminal point detecting method based on sub belt energy and characteristic detecting technique
CN1628337A (en) * 2002-06-12 2005-06-15 三菱电机株式会社 Speech recognizing method and device thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020118648A1 (en) * 2001-02-15 2002-08-29 Olaf Zaencker Method and arrangement for testing the transmission system and method for quality of a speech transmission
EP1284551A1 (en) * 2001-08-08 2003-02-19 Siemens Aktiengesellschaft Assignment of a service quality for information transfer within a communication network
CN1427395A (en) * 2001-12-17 2003-07-02 中国科学院自动化研究所 Speech sound signal terminal point detecting method based on sub belt energy and characteristic detecting technique
CN1628337A (en) * 2002-06-12 2005-06-15 三菱电机株式会社 Speech recognizing method and device thereof

Also Published As

Publication number Publication date
CN1912993A (en) 2007-02-14

Similar Documents

Publication Publication Date Title
CN100580770C (en) Voice end detection method based on energy and harmonic
CN100580768C (en) Voiced sound detection method based on harmonic characteristic
Hoyt et al. Detection of human speech in structured noise
KR100312919B1 (en) Method and apparatus for speaker recognition
CN103854662B (en) Adaptive voice detection method based on multiple domain Combined estimator
US8554560B2 (en) Voice activity detection
CN101872616B (en) Endpoint detection method and system using same
US20050060153A1 (en) Method and appratus for speech characterization
US20080300875A1 (en) Efficient Speech Recognition with Cluster Methods
Ying et al. A probabilistic approach to AMDF pitch detection
CN101228577B (en) Automatic speech recognition channel normalization method and system
CN101206858B (en) Method and system for testing alone word voice endpoint
CN103559879A (en) Method and device for extracting acoustic features in language identification system
CN109243497A (en) The control method and device that voice wakes up
JP3105465B2 (en) Voice section detection method
Dubuisson et al. On the use of the correlation between acoustic descriptors for the normal/pathological voices discrimination
Padrell et al. Robust speech activity detection using LDA applied to FF parameters
Hansen et al. Robust speech recognition training via duration and spectral-based stress token generation
Ma et al. A path-stack algorithm for optimizing dynamic regimes in a statistical hidden dynamic model of speech
US6055499A (en) Use of periodicity and jitter for automatic speech recognition
Zolnay et al. Extraction methods of voicing feature for robust speech recognition.
Sarikaya et al. Robust detection of speech activity in the presence of noise
Sarikaya et al. Robust speech activity detection in the presence of noise
Sorin et al. The ETSI extended distributed speech recognition (DSR) standards: client side processing and tonal language recognition evaluation
Dumpala et al. Robust Vowel Landmark Detection Using Epoch-Based Features.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100113