CN1912992A

CN1912992A - Voiced sound detection method based on harmonic characteristic

Info

Publication number: CN1912992A
Application number: CNA2005100899564A
Authority: CN
Inventors: 国雁萌; 付强
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2005-08-08
Filing date: 2005-08-08
Publication date: 2007-02-14
Anticipated expiration: 2025-08-08
Also published as: CN100580768C

Abstract

This invention relates to a method for testing sonant signals in a noisy environment including: carrying out pre-process and N point scatter Fourier transformation to each frame signal, computing energy of each band of a frame then searching for the extremum of the energy, matching the band energy extremum with all possible keynotes in human keynote sphere to find out possible existent keynotes according to the harmonic character of the human sonant, making integral sonant judgment to certain numbers of frames based on the phonetic character, shifting and updating buffer-storage then continuing testing sonant in the updated buffer storage. Advantage: stable performance can be maintained even if in the environment of low S/N ratio or changeable noises and accurate positions of sonant can be tested even though the keynote is low.

Description

A kind of voiced sound detection method based on harmonic characteristic

Technical field

The present invention relates to field of voice signal, particularly detect a kind of method of voiced sound in the noise circumstance.

Background technology

In the practical speech signal processing system, need to detect the position of voiced sound under a lot of situations.For example, detect in (Voice Activity Detection) system,, can judge roughly whether voice exist by detecting voiced sound at speech; In the speech signal segments system, detect the division that the voiced sound position helps signal segment.

Voiced sound is the cycle with the fundamental tone on time domain, has harmonic structure on frequency domain, therefore, voiced sound detection method commonly used mainly utilize the fundamental tone feature of voiced sound and harmonic wave in all frequency bands total intensity and judge the voiced sound position.But when voice signal was subjected to noise, these method performances can descend to some extent.For example, Irwin proposed the periodically strong and weak mode of a kind of assessing signal in 1979, be called least square and periodically estimate (least-squares periodicity estimation), R.Tucker has carried out the sound end detection in this way in 1992 annual interests, but the method is easy to be subjected to low frequency noises.Yutaka Kobayashi proposed in 1980, and the search extreme value can be sought voiced sound in the cepstrum of signal.But this method can not be used under the low condition of signal to noise ratio (S/N ratio) equally, and is difficult to tackle low-frequency noise.Calendar year 2001, Tong Zhang is voice in the detection signal and music when doing the audio-video signal segmentation, has used a kind of method of fundamental frequency (short-timefundamental frequency is called for short SFuF) in short-term that is called.SFuF estimates autoregressive model by calculating auto-correlation in time domain, forms level and smooth spectrum again, and from spectrum its limit of search and obtain fundamental frequency, find the position of voiced sound with this.But this method is easy to by noise, also can be because of the frequency multiplication of fundamental frequency being mistakened as into fundamental frequency and losing efficacy.2003, Ahmad R.Abu-El-Quran divided time-like carrying out sound signal, proposed a kind of algorithm that detects voice in signal.This method is at first carried out low-pass filtering and center clipping and is calculated auto-correlation every frame signal, searches for extreme value then in autocorrelation function, judges whether this frame signal contains harmonic wave.By calculating the shared ratio of harmonic wave frame in the segment signal, judge whether this segment signal contains voice at last.But this method is subjected to the interference of fundamental tone frequency multiplication equally easily, and is difficult to be applied in the stronger or changeable situation of noise.

In the actual speech signal processing system, input signal is in the signal to noise ratio (S/N ratio) difference of each frequency band, noise fundamental tone frequently near energy by force the time, although harmonic characteristic clearly, detect but fundamental tone is difficult, so search for the method for voiced sound and be easy to be interfered by detecting fundamental tone.Equally, if detect the harmonic wave of voiced sound at the full range band, then the frequency band that signal to noise ratio (S/N ratio) is low also can influence voiced sound and detects.Therefore, this algorithm is only searched for 4～5 harmonic waves the most clearly, thereby avoids the low frequency band of signal to noise ratio (S/N ratio) automatically, can be when noise be strong steady operation, and insensitive to the variation of noise.

Because harmonic wave and fundamental tone have been concentrated the main energy of voiced sound, and harmonic frequency is the integral multiple of fundamental frequency, so there is equally distributed energy extreme value in pure voiced sound on frequency domain, and it equals fundamental frequency at interval.Even the voiced sound signal is subjected to sound pick-up outfit and interference of noise, also can keep 4～5 equidistant energy extreme values usually at frequency domain, the present invention that Here it is detects the main foundation of voiced sound by harmonic characteristic.

Summary of the invention

The objective of the invention is in noisy speech signal, to detect the position of voiced sound.

For achieving the above object, the voiced sound detection method based on harmonic characteristic provided by the invention comprises the steps:

1) digitized sound signal to input carries out pre-service such as branch frame, windowing, pre-emphasis, and sets buffer memory, in order to temporary L frame signal (total length of time is 200～300 milliseconds) and intermediate result thereof, top L frame input signal is deposited in buffer memory;

2) establishing frame length is F, and to each frame signal in the buffer memory, first zero padding is to N point (N 〉=F wherein, N=2 ^x, x is integer and x 〉=8), carry out leaf transformation in the N point discrete Fourier then.Calculate each frequency band energy according to the Fourier transform value, and search for the frequency domain energy extreme value of this frame signal with this;

3) to each frame signal in the buffer memory, according to human voiced sound in the characteristic aspect the frequency domain energy, to step 2) in the institute of the frequency domain energy extreme point that searches and human fundamental frequency scope might mate by fundamental tone the fundamental tone that finds this frame to exist;

4), in all possible fundamental tone that in step 3), searches, merge its medium frequency and differ less fundamental tone, and these fundamental tones that remain are formed a set to each frame signal in the buffer memory;

5) according to the fundamental tone and the continuous gradual characteristic of each harmonic of voiced sound, the information of front and back frame in the binding cache is done the voiced sound existence judgement of globality to the frame of some;

6) if do not had new digitized sound signal input, then the voiced sound testing result and the detection of end of all frames in the output buffers; Otherwise, export the voiced sound court verdict of the 1st frame, abandon the content of buffer memory first frame, buffer memory the 2nd to L frame signal and corresponding result of calculation thereof move forward a frame respectively, and import a new frame signal to buffer memory L frame, this frame new signal is carried out pre-service, gets back to step 2), continue the signal that upgrades in the buffer memory of back is carried out the harmonic wave detection.

In the technique scheme, described step 2) comprise following substep:

21) establishing frame length is F, chooses a frame signal in the buffer memory earlier, and zero padding is to N point (N 〉=F wherein, N=2 ^x, x is integer and x 〉=8), carry out leaf transformation in the N point discrete Fourier then.Calculate the energy of N/2 frequency band of this frame, be expressed as ε from low to high by frequency ₁, ε ₂... ε _N/2, the minimum value of frequency band energy is designated as ε _Min

22) be bin (the frequency band energy ε of 1＜bin＜N/2) wherein to frequency band number _BinIf, ε _BinSatisfy ε simultaneously _Bin＞ε _Bin-1, ε _Bin＞ε _Bin+1, and ε _Bin＞M ε _Min, mark ε then _BinBe the energy extreme value, wherein M is an empirical constant, 5＜M＜20;

23) establishing the energy extreme value number that finds is u, by frequency order from low to high, recording step 22) in satisfy the ε of extremum conditions _BinSubscript position (being the value of bin), be expressed as I _k(k=1..u).

In the technique scheme, described step 3) comprises following substep:

Step 31) according to human fundamental tone scope (60～450Hz), travel through all possible fundamental frequency pitch _αWherein, pitch _αScope be [60/R]≤pitch _α≤ [450/R], maximum integer less than this value is got in [] expression, and R=sampling rate/N represents the width of each frequency band.

Step 32) to the fundamental frequency pitch of current detection _αIf in existing extreme point, find 5 with pitch _αBe the point of spacing, perhaps find 4 with pitch _αBe the point of spacing, and comprise that is positioned at a pitch in these 4 points _αPoint, then preliminary determine when last frame signal may be pitch corresponding to fundamental frequency _αVoiced sound.Therefore, if according to pitch _αThe normalization variance of spacing satisfies certain threshold value Var between the most uniform 4 or 5 extreme points that search _Thd, then think pitch _αMight exist, and the spacing average of the most even extreme point as accurate fundamental frequency;

Step 33) behind all possible fundamental frequency of traversal, record is when pairing each the accurate fundamental frequency that may exist of last frame signal, and the pairing normalization variance of this accurate fundamental frequency.

In the technique scheme, described step 4) comprises following substep:

41) when satisfying step 32) differ less than D between two accurate fundamental frequencies of condition _PitchThe time, select to write down less accurate fundamental frequency of wherein normalization variance and corresponding normalization variance thereof; D wherein _Pitch=D _Min/ R, D _MinBe empirical value (a 50＜D _Min＜150), unit is Hz.

42) at all pitch _α([60/R]≤pitch _α≤ [450/R]) in, if α=α ₁, α ₂, α ₃.... can both correspond to equally distributed extreme value, and accurately the mutual gap of fundamental frequency all meets step 41) condition, then each self-corresponding accurate fundamental frequency is designated as a set.

In the technique scheme, described step 5) comprises following substep:

51) in buffer memory in continuous 4 frame signals, if arbitrarily in the set that the accurate fundamental frequency of adjacent two frames is formed, all there is an element at least, very approaching with an element in the other side's set, then can think to have voiced sound in this 4 frame signal.The standard that two elements are close is: both differences and among both the ratio of smaller value be no more than constant ratio (wherein 10%＜ratio＜20%);

52) in step 51) make preliminary ruling after, being in the buffer memory between two sections voiced sounds, and time span all is judged to voiced sound less than 20 milliseconds non-voiced sound signal segment.

In the technique scheme, described step 6) comprises following substep:

61) if do not had new digitized sound signal input, then the voiced sound testing result and the detection of end of all frames in the output buffers;

62) if current also have the input of new digitized signal, then export the voiced sound court verdict of the 1st frame, abandon the content of buffer memory first frame, buffer memory the 2nd frame is moved forward 1 frame respectively to the L frame signal, corresponding every frame result of calculation also moves forward respectively; A new frame of digital voice signal is deposited in buffer memory L frame, this frame new signal is carried out pre-service, and get back to step 2), continue that signal in the buffer memory after upgrading is carried out harmonic wave and detect.

Compared with prior art, advantage of the present invention is: the present invention is search voiced sound harmonic wave frequency band the most clearly in all frequency bands automatically, so even lower or noise changes faster in the environment in signal to noise ratio (S/N ratio), also can keep stability; The present invention utilizes the harmonic wave of voiced sound and fundamental tone mutually continuously, and harmonic wave is that the characteristic of fundamental tone frequency multiplication detects voiced sound, so do not need to determine concrete fundamental frequency, can not disturbed by the frequency multiplication of fundamental tone as the method for searching for voiced sound by the detection fundamental frequency, even fundamental frequency is lower, the present invention also can detect the position of voiced sound; The present invention has utilized human pronunciation mechanism to change universal feature slowly, and to the appearance of voiced sound and duration without limits, and the fundamental tone scope of search comprises the possible fundamental tone of institute of human speech, thereby can be adapted to the speaker of all kinds of ages, sex and the custom of speaking; The present invention detects voiced sound by the search harmonic wave, and the harmonic characteristic of voiced sound is all more stable under different recording and transmission conditions, so the present invention does not have specific (special) requirements to the recording and the transmission conditions of signal, even fundamental frequency frequency range disappearance, also can accurately detect the position of voiced sound, this voiced sound that also is adapted to especially in the telephone channel detects.The present invention can be applied to sound end detection, audio-video signal segmentation, speech segmentation system, and in the pre-service of voice coding and speech-enhancement system.

Description of drawings

Fig. 1 is the process flow diagram of the voiced sound detection method based on harmonic characteristic provided by the invention;

Fig. 2 is the harmonic wave judgement synoptic diagram of continuous 4 frames among the present invention.

Embodiment

In the practical speech signal processing system, voice signal often is subjected to noise, and noise is not only inhomogeneous in the energy distribution of each frequency band, and constantly changes.Therefore, noisy speech signal is in the signal to noise ratio (S/N ratio) difference of each frequency band, and changes along with the variation of voice and noise.When near the energy fundamental tone frequency band of noise at voiced sound was strong, although the harmonic characteristic of voiced sound signal clearly, fundamental tone is difficult to be detected, so be easy to be interfered by detecting the method that fundamental tone detects voiced sound.Equally, because the signal to noise ratio (S/N ratio) of some frequency band is lower, if search for voiced sound by detecting harmonic wave at the full range band, then the frequency band that signal to noise ratio (S/N ratio) is low can reduce the overall reliability of detection.Therefore, this algorithm is only searched for 4～5 harmonic waves the most clearly, thereby avoids the low frequency band of signal to noise ratio (S/N ratio) automatically, can be when noise be strong steady operation, and insensitive to the variation of noise.

Because harmonic wave and fundamental tone have been concentrated the main energy of voiced sound, and harmonic frequency is the integral multiple of fundamental frequency, so there is equally distributed energy extreme value in pure voiced sound on frequency domain, and it equals pitch period at interval.Even the voiced sound signal is subjected to sound pick-up outfit and interference of noise, also can keep 4～5 equidistant energy extreme values usually at frequency domain, the present invention that Here it is detects the main foundation of voiced sound by harmonic characteristic.

Below in conjunction with drawings and the specific embodiments the present invention is done description further.

Embodiment:

As shown in Figure 1, the concrete steps of present embodiment are as follows:

Step 301: L frame of digital signal is carried out pre-service respectively and deposits buffer memory in, prepare to detect therein voiced sound.

Step 302: take out the frame signal (establish it and be i frame in the buffer memory) also do not carry out harmonic wave and to detect in buffer memory, carry out pre-service such as windowing and pre-emphasis according to system's concrete condition, zero padding is to the N point, (N 〉=F wherein, N=2 ^x, x is integer and x 〉=8), carry out leaf transformation in the N point discrete Fourier then, obtain discrete spectrum

X (i, bin) = Σ_{n = 0}^{N - 1} x (i, n) e^{- j (2 π / N) n \cdot bin},

Bin=0,1 ..., N-1, wherein (bin is the frequency spectrum numbering to x for i, n) n sampled value of i frame signal in the expression buffer memory.Delivery square, each frequency band energy ε _Bin=| X (i, bin) | ²Wherein bin is a frequency band number, 1≤bin≤N/2, the width R=sampling rate/N of each frequency band.

Step 303: the minimum value of frequency band energy is designated as ε _Min, at ε _Bin(wherein among 1＜bin＜N/2), if ε _BinSatisfy ε simultaneously _Bin＞ε _Bin-1, ε _Bin＞ε _Bin+1, and ε _Bin＞M ε _Min, mark ε then _BinBe the energy extreme value, it might be corresponding to harmonic wave or fundamental tone, also may be corresponding to the noise of chance or the minor swing of voice spectrum.Wherein M is an empirical constant, 5＜M＜20.If satisfy total u of the frequency band of this requirement, by frequency order from low to high, writing down its position (is ε _BinSubscript bin) I _k(k=1..u).Why require ε _Bin＞M ε _Min, be because when having harmonic wave in the signal, signal energy distributes uneven, ε _MinDiffer bigger, ε with the harmonic band energy _BinHave only and satisfy this condition, just may be corresponding to harmonic wave or fundamental tone.

Step 304: to possible fundamental tone, search approaches the The extreme value distribution of this fundamental tone most, its objective is to determine to work as last frame signal may corresponding to which voiced sound.Human fundamental tone scope is 60～450Hz, if use pitch _αThe frequency band number of expression fundamental tone correspondence, then [60/R]≤pitch _α≤ [450/R].Wherein, the maximum integer less than this value is got in [] expression, and R is the width of each frequency band, i.e. frequency domain resolution (seeing step 302).Because the energy advantage of fundamental tone and harmonic wave is more obvious at low frequency, thus only the extreme value in 60～2000Hz scope is mated, corresponding to [60/R]≤I _k≤ [2000/R].

Therefore, the process of search is: fundamental tone pitch _αTraversal [60/R] all integers in [450/R] are searched for and pitch in being numbered [60/R] energy extreme point in [2000/R] scope respectively _αAnd the approaching point in harmonic wave position.Because voiced sound shows at least 4～5 equally distributed energy extreme points usually on frequency domain, so, if having 5 in the existing extreme point with pitch _αBe the point of spacing, its frequency band number F _mSatisfy F _m≈ mpitch _α, wherein

m = 2,3, . . [\frac{2000}{R {\cdot pitch}_{α}}];

Perhaps there are 4 with pitch _αBe the point of spacing, its frequency band number F _mSatisfy F _m≈ mpitch _α, m=1 wherein, 2,3,4, then this frame signal may be pitch corresponding to fundamental tone _αVoiced sound; Otherwise this frame signal can not be corresponding to this voiced sound.

For this reason, to current selected tests fundamental tone pitch _α, from I _k(k=1..u) search and mpitch in _αImmediate value I _{M '}, I _{M '}Satisfy | I _{M '}-mpitch _α|≤| I _{M "}-mpitch _α| (1≤m '≤u, 1≤m "≤u, m " ≠ m ').All corresponding to mpitch _α(wherein

m = 1,2, . . [\frac{2000}{R \cdot {pitch}_{α}}])

I _{M '}(wherein

m^{'} = 1,2, . . [\frac{2000}{R \cdot {pitch}_{α}}]

) be designated as set { P ₁, P ₂, P ₃..., and increase an element P ₀=0, form set { P ₀, P ₁, P ₂, P ₃....

Step 305: if set { P ₀, P ₁, P ₂, P ₃... 5 continuous element P of middle existence _t, P _T+1, P _T+2, P _T+3, P _T+4(wherein

t = 0,1, . . [\frac{2000}{R \cdot {pitch}_{α}} - 4]

) spacing equates, thinks that then The extreme value distribution and fundamental tone are complementary.Because frequency span is R, promptly frequency domain resolution can not be accomplished definitely accurately, so the exact value of fundamental tone can be expressed as the spacing average of these 5 elements.

Whether equate pair set { P for estimating 5 element spacings ₀, P ₁, P ₂, P ₃... in any 5 continuous elements, calculate its spacing D ₁=P _T+1-P _t, D ₂=P _T+2-P _T+1, D ₃=P _T+3-P _T+2, D ₄=P _T+4-P _T+3(wherein

t = 0,1, . . [\frac{2000}{R \cdot {pitch}_{α}} - 4]

)。Ask { D ₁, D ₂, D ₃, D ₄Average D, and, be expressed as its variance normalization

Var = \frac{Σ_{q = 1}^{4} {(D_{q} - \overset{&OverBar;}{D})}^{2}}{4 {\overset{&OverBar;}{D}}^{2}} .

Var is more little, and then spacing is even more, and the possibility that contains voiced sound is also big more.Corresponding to same pitch _αA plurality of 5 element combinations are arranged, and select normalization variance minimum one group, record its spacing average is D _α, the normalization variance is Var _α

Step 306: if Var _α＜Var _Thd(Var wherein _ThdBe empirical value, 0.001＜Var _Thd＜0.003), thinks that then when last frame signal may contain fundamental tone be D _αVoiced sound, D wherein _αRepresent the accurate fundamental frequency of this group spacing correspondence, and enter step 307.If at set { P ₁, P ₂, P ₃... in, corresponding to the spacing { D of any 5 element combinations ₁, D ₂, D ₃, D ₄All do not satisfy Var＜Var _Thd, then enter step 310.

Step 307: judge D _αOther accurate fundamental frequency of having preserved with present frame is very approaching.Process is as follows:

At fundamental tone pitch _αIn the search procedure of interior all integers of traversal [60/R] to [450/R], if fundamental tone pitch was once arranged _βAlso corresponding to uniform The extreme value distribution, i.e. pitch _αAnd pitch _βCorresponding Var _αAnd Var _βAll less than the Var in the step 306 _Thd, Var then _βAnd D _βAll be saved.Check | D _α-D _β| whether greater than threshold value D _Pitch, D wherein _Pitch=D _Min/ R, D _MinBe empirical value (a 50＜D _Min＜150), unit is Hz.

If | D _α-D _β| 〉=D _Pitch, then enter step 308.

If | D _α-D _β|＜D _Pitch, then enter step 309.

Step 308: at current accurate fundamental frequency D _αPairing spacing normalization variance is enough little, and all accurate fundamental frequencies enter this step all at a distance of enough greatly the time with other.Write down current apart from average D _αWith normalization variance Var _αAll pitch in [60/R] to [450/R] _αIn, if α=α ₁, α ₂, α ₃... can both correspond to equally distributed extreme value, then keep their normalization variance and its accurate fundamental frequency, be designated as VAR={Var1, Var2...} and D _All={ D1, D2...}.This is because when input signal was the very low voiced sound of fundamental frequency, 2 frequencys multiplication of fundamental tone even 3 frequencys multiplication also all in human fundamental frequency scope, promptly were also included within pitch _αThe hunting zone in, and its corresponding extreme value spacing may be more even.So not necessarily corresponding real fundamental tone of normalization variance minimal data.Voiced sound detection algorithm commonly used is often by following the tracks of fundamental frequency location voiced sound, thereby is subjected to the interference of fundamental tone frequency multiplication easily and lost efficacy, and this method keeps multi-group data and in the end screens in conjunction with the front and back frame, can avoid fundamental tone frequency multiplication interference problem.

Step 309: in step 307 | D _α-D _β|＜D _PitchThe time enter this step.In two approaching accurate fundamental frequencies of frequency, only keep less one of wherein normalization variance.If i.e.: Var _α＜Var _β, then from set VAR and D _AllIn leave out D _βAnd Var _βRecord, and Var _αAnd D _αRecord wherein; Otherwise, then keep D _βAnd Var _βRecord, and do not write down D _αRelevant information.This is because if two very approaching fundamental tones all may exist, then may be because pitch _αBe subjected to spectral resolution restriction and coarse cause, select more uniform one group of The extreme value distribution, can keep the fundamental tone of approaching reality.

Step 310: judge whether to travel through all possible fundamental tone (i.e. [60/R]≤pitch _αAll integers in≤[450/R] scope), if the judgment is Yes, enter step 311; If the judgment is No, get back to step 304.

Step 311: judge whether that all frames are all handled in the buffer memory.Every frame signal is at process [60/R]≤pitch _αAfter the test of all fundamental tones among≤[450/R] (be each possible fundamental tone and search for the extreme value of mating most in current all extreme values), that finally remain is VAR and D _AllThe frame that has is not because all have satisfactory The extreme value distribution to any fundamental tone, possible its VAR and D _AllIt all is empty set.If also have signal not pass through the extremum search of step 302 to 310 in the buffer memory and, then return step 302,, then enter step 312 if all frames are all searched for and tested in the buffer memory to test that may fundamental tone.

Step 312: beginning is in conjunction with several frame informations in front and back, and whether the preliminary judgement harmonic wave exists.At first setting the first frame number t that detects is 1.

Step 313: detect t and whether can connect into harmonic wave to the t+3 frame, i.e. whether judgement exists voiced sound.Judgment condition is: in continuous 4 frame signals, if the D of any adjacent two frames _AllIn all have an element very approaching at least with the other side's element, then can think to have harmonic wave in this 4 frame signal.Specifically, as long as adjacent two frames D separately _AllIn have an element separately at least with the other side's D _AllIn certain element difference and among both the ratio of smaller value be no more than constant ratio, think that then they meet the requirements.The span of ratio is 10%～20%.Fig. 2 is the rough schematic that connects.As seen, in 4 frames (i.e. t, t+1, t+2 and t+3 frame), as long as any two continuous frames all has a paths to be communicated with, just can declare this 4 frame is voiced sound.

Why requiring must be communicated with between two frames, is because the fundamental frequency of voiced sound and harmonic wave all are gradual continuously, does not have sudden change between two frames.Like this, even certain segment signal equally distributed energy extreme value occurs because of interference, can not be communicated with to get up continuously mutually yet, thereby can not be judged to voiced sound.In addition, so the normalization variance that the fundamental tone frequency multiplication of some voiced sound is also corresponding very little is the D of each frame _AllIn may comprise fundamental frequency and frequency multiplication thereof respectively, therefore,, promptly in two frames of front and back a connecting path is arranged as long as fundamental frequency or frequency multiplication can be communicated with, just can think that the fundamental frequency of this two frame and harmonic wave are continuous, this has just prevented failing to judge of real voiced sound.

Step 314: the start frame numbering t that judges harmonic wave is increased by 1, and continuing to detect down, whether 4 frames are communicated with.

Step 315: judge whether 4 all in buffer memory successive frames all to be done the harmonic wave connection, that is,, then enter step 316, otherwise return step 313 if tested the L-3 frame that is over to continuous 4 frames of L frame.

Step 316: because voiced sound is continuously gradual, so when two sections voiced sounds very near the time, can think that the centre also is a voiced sound for one section.Therefore, information before and after utilizing is done further connection and shaping to declaring the voiced sound that in the buffer memory, and being in the buffer memory between two sections voiced sounds, and time span all is judged to voiced sound less than 20 milliseconds non-voiced sound signal segment.

Step 317: judge currently whether also have new digitized sound signal to need to handle.If also have the signal that needs processing, enter step 318; If do not had the new signal input, then entered step 319.

Step 318: export the voiced sound court verdict of the 1st frame, abandon the content of buffer memory first frame, buffer memory the 2nd frame is moved forward 1 frame respectively to the L frame signal, corresponding every frame result of calculation also moves forward respectively; A new frame of digital voice signal is deposited in buffer memory L frame, and get back to step 302.

Step 319: the voiced sound testing result of every frame signal in the output buffers, the harmonic wave testing process of all signals finishes.

Claims

1, a kind of voiced sound detection method based on harmonic characteristic is characterized in that, comprises the steps:

1) digitized signal of input is carried out pre-service, this pre-service comprises branch frame, windowing, pre-emphasis; Adopt one section buffer memory, preserve length and be 200～400 milliseconds signal and results of intermediate calculations thereof, to every frame signal zero padding and carry out leaf transformation in the N point discrete Fourier; N=2 wherein ^x, x is an integer, x 〉=8;

2), calculate the energy of this each frequency band of frame respectively, and search for the local extremum of frequency domain energy in view of the above to each frame signal;

3) according to the harmonic wave characteristics of human voiced sound, to step 2) in the frequency band energy extreme value that searches and all fundamental tones of human fundamental tone scope mate, find the fundamental tone of coupling;

4) in all fundamental tones that are complementary with a frame signal, merge wherein at a distance of nearer fundamental tone, and these coupling fundamental tones that remain are formed a set;

5), the frame of some is done the voiced sound judgement of globality in conjunction with front and back information according to phonetic feature;

6) current cache finishes as calculated, if finished the detection of all signals, then exports the result, otherwise the signal in the mobile update buffer memory, and get back to step 1), continue that the signal in the current cache is carried out harmonic wave and detect.

2,, it is characterized in that described step 2 by the described voiced sound detection method of claim 1 based on harmonic characteristic) comprise following substep:

21) choose a frame earlier, calculate the energy of N/2 frequency band of this frame, be expressed as ε from low to high by frequency ₁, ε ₂... ε _N/2, the minimum value of frequency band energy is designated as ε _Min

22) frequency band number bin from 2 to Scope in, if frequency band energy ε _BinSatisfy ε simultaneously _Bin＞ε _Bin-1, ε _Bin＞ε _Bin+1, and ε _Bin＞M ε _Min, mark ε then _BinBe the energy extreme value, wherein M is an empirical constant, 5＜M＜20;

23) establishing the energy extreme value number that searches is u, by frequency order from low to high, recording step 22) in satisfy the ε of extremum conditions _BinSubscript position, be expressed as Ik, k=1 wherein, 2 ... u.

3, by the described voiced sound detection method of claim 1, it is characterized in that described step 3) comprises following substep based on harmonic characteristic:

31) all possible fundamental frequency pitch of traversal _α, its scope is [60/R]≤pitch _α≤ [450/R], wherein the maximum integer less than this value is got in [] expression, and R is the width of each frequency band, R=sampling rate/N;

32) to the fundamental frequency pitch of current detection _αIf in existing extreme point, find 5 with pitch _αBe the point of spacing, perhaps find 4 with pitch _αBe the point of spacing, and comprise that is positioned at a pitch in these 4 points _αPoint, then preliminary determine when last frame signal may be pitch corresponding to fundamental frequency _αVoiced sound; If the normalization variance of spacing is less than certain threshold value Var between these points _Thd, think that then spacing is even, the average of spacing between the most even extreme point that searches as accurate fundamental frequency;

33) behind all possible fundamental frequency of traversal, record is when pairing each accurate fundamental frequency that may exist of last frame signal and corresponding normalization variance.

4,, it is characterized in that described step 33 by the described voiced sound detection method of claim 3 based on harmonic characteristic) finish after, enter step 4), described step 4) comprises following substep:

41) when step 33) in differ less than D between two accurate fundamental frequencies satisfying condition _PitchThe time, only write down less accurate fundamental frequency of wherein normalization variance and corresponding normalization variance thereof; D wherein _Pitch=D _Min/ R, D _MinBe an empirical value, its unit is Hz, and span is 50＜D _Min＜150;

42) at scope [60/R]≤pitch _αAll pitch in≤[450/R] _αIn, if α=α ₁, α ₂, α ₃.... can both correspond to equally distributed extreme value, and be not less than step 41 apart) in threshold value D _Pitch, then each self-corresponding accurate fundamental frequency is designated as a set.

5, by the described voiced sound detection method of claim 4, it is characterized in that enter step 5) after described step 4) is finished, described step 5) comprises following substep based on harmonic characteristic:

51) in 4 frame signals of front and back, if arbitrarily in the set that the accurate fundamental frequency of adjacent two frames is formed, all there is an element at least, very approaching with an element in the other side's set, then can think to have voiced sound in this 4 frame signal; The standard that two elements are close is: both differences and among both the ratio of smaller value be no more than proportionality constant ratio, wherein the scope of ratio is 10%～20%;

52) in step 51) make preliminary ruling after, being in the buffer memory between two sections voiced sounds, and length all is judged to voiced sound less than 20 milliseconds non-voiced sound signal segment.

6, by the described voiced sound detection method of claim 5, it is characterized in that enter step 6) after described step 5) is finished, described step 6) comprises following substep based on harmonic characteristic:

61) if all signals are all finished after testing, do not have new input signal, then export end product;

62) if also have signal not detect, then buffer memory the 2nd frame is moved forward 1 frame respectively to the L frame signal, corresponding every frame result of calculation also moves forward respectively; A new frame of digital voice signal input-buffer L frame, and get back to step 1), continue that the signal in the current cache is carried out harmonic wave and detect.