CN1912992A - Voiced sound detection method based on harmonic characteristic - Google Patents
Voiced sound detection method based on harmonic characteristic Download PDFInfo
- Publication number
- CN1912992A CN1912992A CNA2005100899564A CN200510089956A CN1912992A CN 1912992 A CN1912992 A CN 1912992A CN A2005100899564 A CNA2005100899564 A CN A2005100899564A CN 200510089956 A CN200510089956 A CN 200510089956A CN 1912992 A CN1912992 A CN 1912992A
- Authority
- CN
- China
- Prior art keywords
- frame
- signal
- voiced sound
- pitch
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Electrophonic Musical Instruments (AREA)
Abstract
This invention relates to a method for testing sonant signals in a noisy environment including: carrying out pre-process and N point scatter Fourier transformation to each frame signal, computing energy of each band of a frame then searching for the extremum of the energy, matching the band energy extremum with all possible keynotes in human keynote sphere to find out possible existent keynotes according to the harmonic character of the human sonant, making integral sonant judgment to certain numbers of frames based on the phonetic character, shifting and updating buffer-storage then continuing testing sonant in the updated buffer storage. Advantage: stable performance can be maintained even if in the environment of low S/N ratio or changeable noises and accurate positions of sonant can be tested even though the keynote is low.
Description
Technical field
The present invention relates to field of voice signal, particularly detect a kind of method of voiced sound in the noise circumstance.
Background technology
In the practical speech signal processing system, need to detect the position of voiced sound under a lot of situations.For example, detect in (Voice Activity Detection) system,, can judge roughly whether voice exist by detecting voiced sound at speech; In the speech signal segments system, detect the division that the voiced sound position helps signal segment.
Voiced sound is the cycle with the fundamental tone on time domain, has harmonic structure on frequency domain, therefore, voiced sound detection method commonly used mainly utilize the fundamental tone feature of voiced sound and harmonic wave in all frequency bands total intensity and judge the voiced sound position.But when voice signal was subjected to noise, these method performances can descend to some extent.For example, Irwin proposed the periodically strong and weak mode of a kind of assessing signal in 1979, be called least square and periodically estimate (least-squares periodicity estimation), R.Tucker has carried out the sound end detection in this way in 1992 annual interests, but the method is easy to be subjected to low frequency noises.Yutaka Kobayashi proposed in 1980, and the search extreme value can be sought voiced sound in the cepstrum of signal.But this method can not be used under the low condition of signal to noise ratio (S/N ratio) equally, and is difficult to tackle low-frequency noise.Calendar year 2001, Tong Zhang is voice in the detection signal and music when doing the audio-video signal segmentation, has used a kind of method of fundamental frequency (short-timefundamental frequency is called for short SFuF) in short-term that is called.SFuF estimates autoregressive model by calculating auto-correlation in time domain, forms level and smooth spectrum again, and from spectrum its limit of search and obtain fundamental frequency, find the position of voiced sound with this.But this method is easy to by noise, also can be because of the frequency multiplication of fundamental frequency being mistakened as into fundamental frequency and losing efficacy.2003, Ahmad R.Abu-El-Quran divided time-like carrying out sound signal, proposed a kind of algorithm that detects voice in signal.This method is at first carried out low-pass filtering and center clipping and is calculated auto-correlation every frame signal, searches for extreme value then in autocorrelation function, judges whether this frame signal contains harmonic wave.By calculating the shared ratio of harmonic wave frame in the segment signal, judge whether this segment signal contains voice at last.But this method is subjected to the interference of fundamental tone frequency multiplication equally easily, and is difficult to be applied in the stronger or changeable situation of noise.
In the actual speech signal processing system, input signal is in the signal to noise ratio (S/N ratio) difference of each frequency band, noise fundamental tone frequently near energy by force the time, although harmonic characteristic clearly, detect but fundamental tone is difficult, so search for the method for voiced sound and be easy to be interfered by detecting fundamental tone.Equally, if detect the harmonic wave of voiced sound at the full range band, then the frequency band that signal to noise ratio (S/N ratio) is low also can influence voiced sound and detects.Therefore, this algorithm is only searched for 4~5 harmonic waves the most clearly, thereby avoids the low frequency band of signal to noise ratio (S/N ratio) automatically, can be when noise be strong steady operation, and insensitive to the variation of noise.
Because harmonic wave and fundamental tone have been concentrated the main energy of voiced sound, and harmonic frequency is the integral multiple of fundamental frequency, so there is equally distributed energy extreme value in pure voiced sound on frequency domain, and it equals fundamental frequency at interval.Even the voiced sound signal is subjected to sound pick-up outfit and interference of noise, also can keep 4~5 equidistant energy extreme values usually at frequency domain, the present invention that Here it is detects the main foundation of voiced sound by harmonic characteristic.
Summary of the invention
The objective of the invention is in noisy speech signal, to detect the position of voiced sound.
For achieving the above object, the voiced sound detection method based on harmonic characteristic provided by the invention comprises the steps:
1) digitized sound signal to input carries out pre-service such as branch frame, windowing, pre-emphasis, and sets buffer memory, in order to temporary L frame signal (total length of time is 200~300 milliseconds) and intermediate result thereof, top L frame input signal is deposited in buffer memory;
2) establishing frame length is F, and to each frame signal in the buffer memory, first zero padding is to N point (N 〉=F wherein, N=2
x, x is integer and x 〉=8), carry out leaf transformation in the N point discrete Fourier then.Calculate each frequency band energy according to the Fourier transform value, and search for the frequency domain energy extreme value of this frame signal with this;
3) to each frame signal in the buffer memory, according to human voiced sound in the characteristic aspect the frequency domain energy, to step 2) in the institute of the frequency domain energy extreme point that searches and human fundamental frequency scope might mate by fundamental tone the fundamental tone that finds this frame to exist;
4), in all possible fundamental tone that in step 3), searches, merge its medium frequency and differ less fundamental tone, and these fundamental tones that remain are formed a set to each frame signal in the buffer memory;
5) according to the fundamental tone and the continuous gradual characteristic of each harmonic of voiced sound, the information of front and back frame in the binding cache is done the voiced sound existence judgement of globality to the frame of some;
6) if do not had new digitized sound signal input, then the voiced sound testing result and the detection of end of all frames in the output buffers; Otherwise, export the voiced sound court verdict of the 1st frame, abandon the content of buffer memory first frame, buffer memory the 2nd to L frame signal and corresponding result of calculation thereof move forward a frame respectively, and import a new frame signal to buffer memory L frame, this frame new signal is carried out pre-service, gets back to step 2), continue the signal that upgrades in the buffer memory of back is carried out the harmonic wave detection.
In the technique scheme, described step 2) comprise following substep:
21) establishing frame length is F, chooses a frame signal in the buffer memory earlier, and zero padding is to N point (N 〉=F wherein, N=2
x, x is integer and x 〉=8), carry out leaf transformation in the N point discrete Fourier then.Calculate the energy of N/2 frequency band of this frame, be expressed as ε from low to high by frequency
1, ε
2... ε
N/2, the minimum value of frequency band energy is designated as ε
Min
22) be bin (the frequency band energy ε of 1<bin<N/2) wherein to frequency band number
BinIf, ε
BinSatisfy ε simultaneously
Bin>ε
Bin-1, ε
Bin>ε
Bin+1, and ε
Bin>M ε
Min, mark ε then
BinBe the energy extreme value, wherein M is an empirical constant, 5<M<20;
23) establishing the energy extreme value number that finds is u, by frequency order from low to high, recording step 22) in satisfy the ε of extremum conditions
BinSubscript position (being the value of bin), be expressed as I
k(k=1..u).
In the technique scheme, described step 3) comprises following substep:
Step 31) according to human fundamental tone scope (60~450Hz), travel through all possible fundamental frequency pitch
αWherein, pitch
αScope be [60/R]≤pitch
α≤ [450/R], maximum integer less than this value is got in [] expression, and R=sampling rate/N represents the width of each frequency band.
Step 32) to the fundamental frequency pitch of current detection
αIf in existing extreme point, find 5 with pitch
αBe the point of spacing, perhaps find 4 with pitch
αBe the point of spacing, and comprise that is positioned at a pitch in these 4 points
αPoint, then preliminary determine when last frame signal may be pitch corresponding to fundamental frequency
αVoiced sound.Therefore, if according to pitch
αThe normalization variance of spacing satisfies certain threshold value Var between the most uniform 4 or 5 extreme points that search
Thd, then think pitch
αMight exist, and the spacing average of the most even extreme point as accurate fundamental frequency;
Step 33) behind all possible fundamental frequency of traversal, record is when pairing each the accurate fundamental frequency that may exist of last frame signal, and the pairing normalization variance of this accurate fundamental frequency.
In the technique scheme, described step 4) comprises following substep:
41) when satisfying step 32) differ less than D between two accurate fundamental frequencies of condition
PitchThe time, select to write down less accurate fundamental frequency of wherein normalization variance and corresponding normalization variance thereof; D wherein
Pitch=D
Min/ R, D
MinBe empirical value (a 50<D
Min<150), unit is Hz.
42) at all pitch
α([60/R]≤pitch
α≤ [450/R]) in, if α=α
1, α
2, α
3.... can both correspond to equally distributed extreme value, and accurately the mutual gap of fundamental frequency all meets step 41) condition, then each self-corresponding accurate fundamental frequency is designated as a set.
In the technique scheme, described step 5) comprises following substep:
51) in buffer memory in continuous 4 frame signals, if arbitrarily in the set that the accurate fundamental frequency of adjacent two frames is formed, all there is an element at least, very approaching with an element in the other side's set, then can think to have voiced sound in this 4 frame signal.The standard that two elements are close is: both differences and among both the ratio of smaller value be no more than constant ratio (wherein 10%<ratio<20%);
52) in step 51) make preliminary ruling after, being in the buffer memory between two sections voiced sounds, and time span all is judged to voiced sound less than 20 milliseconds non-voiced sound signal segment.
In the technique scheme, described step 6) comprises following substep:
61) if do not had new digitized sound signal input, then the voiced sound testing result and the detection of end of all frames in the output buffers;
62) if current also have the input of new digitized signal, then export the voiced sound court verdict of the 1st frame, abandon the content of buffer memory first frame, buffer memory the 2nd frame is moved forward 1 frame respectively to the L frame signal, corresponding every frame result of calculation also moves forward respectively; A new frame of digital voice signal is deposited in buffer memory L frame, this frame new signal is carried out pre-service, and get back to step 2), continue that signal in the buffer memory after upgrading is carried out harmonic wave and detect.
Compared with prior art, advantage of the present invention is: the present invention is search voiced sound harmonic wave frequency band the most clearly in all frequency bands automatically, so even lower or noise changes faster in the environment in signal to noise ratio (S/N ratio), also can keep stability; The present invention utilizes the harmonic wave of voiced sound and fundamental tone mutually continuously, and harmonic wave is that the characteristic of fundamental tone frequency multiplication detects voiced sound, so do not need to determine concrete fundamental frequency, can not disturbed by the frequency multiplication of fundamental tone as the method for searching for voiced sound by the detection fundamental frequency, even fundamental frequency is lower, the present invention also can detect the position of voiced sound; The present invention has utilized human pronunciation mechanism to change universal feature slowly, and to the appearance of voiced sound and duration without limits, and the fundamental tone scope of search comprises the possible fundamental tone of institute of human speech, thereby can be adapted to the speaker of all kinds of ages, sex and the custom of speaking; The present invention detects voiced sound by the search harmonic wave, and the harmonic characteristic of voiced sound is all more stable under different recording and transmission conditions, so the present invention does not have specific (special) requirements to the recording and the transmission conditions of signal, even fundamental frequency frequency range disappearance, also can accurately detect the position of voiced sound, this voiced sound that also is adapted to especially in the telephone channel detects.The present invention can be applied to sound end detection, audio-video signal segmentation, speech segmentation system, and in the pre-service of voice coding and speech-enhancement system.
Description of drawings
Fig. 1 is the process flow diagram of the voiced sound detection method based on harmonic characteristic provided by the invention;
Fig. 2 is the harmonic wave judgement synoptic diagram of continuous 4 frames among the present invention.
Embodiment
In the practical speech signal processing system, voice signal often is subjected to noise, and noise is not only inhomogeneous in the energy distribution of each frequency band, and constantly changes.Therefore, noisy speech signal is in the signal to noise ratio (S/N ratio) difference of each frequency band, and changes along with the variation of voice and noise.When near the energy fundamental tone frequency band of noise at voiced sound was strong, although the harmonic characteristic of voiced sound signal clearly, fundamental tone is difficult to be detected, so be easy to be interfered by detecting the method that fundamental tone detects voiced sound.Equally, because the signal to noise ratio (S/N ratio) of some frequency band is lower, if search for voiced sound by detecting harmonic wave at the full range band, then the frequency band that signal to noise ratio (S/N ratio) is low can reduce the overall reliability of detection.Therefore, this algorithm is only searched for 4~5 harmonic waves the most clearly, thereby avoids the low frequency band of signal to noise ratio (S/N ratio) automatically, can be when noise be strong steady operation, and insensitive to the variation of noise.
Because harmonic wave and fundamental tone have been concentrated the main energy of voiced sound, and harmonic frequency is the integral multiple of fundamental frequency, so there is equally distributed energy extreme value in pure voiced sound on frequency domain, and it equals pitch period at interval.Even the voiced sound signal is subjected to sound pick-up outfit and interference of noise, also can keep 4~5 equidistant energy extreme values usually at frequency domain, the present invention that Here it is detects the main foundation of voiced sound by harmonic characteristic.
Below in conjunction with drawings and the specific embodiments the present invention is done description further.
Embodiment:
As shown in Figure 1, the concrete steps of present embodiment are as follows:
Step 301: L frame of digital signal is carried out pre-service respectively and deposits buffer memory in, prepare to detect therein voiced sound.
Step 302: take out the frame signal (establish it and be i frame in the buffer memory) also do not carry out harmonic wave and to detect in buffer memory, carry out pre-service such as windowing and pre-emphasis according to system's concrete condition, zero padding is to the N point, (N 〉=F wherein, N=2
x, x is integer and x 〉=8), carry out leaf transformation in the N point discrete Fourier then, obtain discrete spectrum
Bin=0,1 ..., N-1, wherein (bin is the frequency spectrum numbering to x for i, n) n sampled value of i frame signal in the expression buffer memory.Delivery square, each frequency band energy ε
Bin=| X (i, bin) |
2Wherein bin is a frequency band number, 1≤bin≤N/2, the width R=sampling rate/N of each frequency band.
Step 303: the minimum value of frequency band energy is designated as ε
Min, at ε
Bin(wherein among 1<bin<N/2), if ε
BinSatisfy ε simultaneously
Bin>ε
Bin-1, ε
Bin>ε
Bin+1, and ε
Bin>M ε
Min, mark ε then
BinBe the energy extreme value, it might be corresponding to harmonic wave or fundamental tone, also may be corresponding to the noise of chance or the minor swing of voice spectrum.Wherein M is an empirical constant, 5<M<20.If satisfy total u of the frequency band of this requirement, by frequency order from low to high, writing down its position (is ε
BinSubscript bin) I
k(k=1..u).Why require ε
Bin>M ε
Min, be because when having harmonic wave in the signal, signal energy distributes uneven, ε
MinDiffer bigger, ε with the harmonic band energy
BinHave only and satisfy this condition, just may be corresponding to harmonic wave or fundamental tone.
Step 304: to possible fundamental tone, search approaches the The extreme value distribution of this fundamental tone most, its objective is to determine to work as last frame signal may corresponding to which voiced sound.Human fundamental tone scope is 60~450Hz, if use pitch
αThe frequency band number of expression fundamental tone correspondence, then [60/R]≤pitch
α≤ [450/R].Wherein, the maximum integer less than this value is got in [] expression, and R is the width of each frequency band, i.e. frequency domain resolution (seeing step 302).Because the energy advantage of fundamental tone and harmonic wave is more obvious at low frequency, thus only the extreme value in 60~2000Hz scope is mated, corresponding to [60/R]≤I
k≤ [2000/R].
Therefore, the process of search is: fundamental tone pitch
αTraversal [60/R] all integers in [450/R] are searched for and pitch in being numbered [60/R] energy extreme point in [2000/R] scope respectively
αAnd the approaching point in harmonic wave position.Because voiced sound shows at least 4~5 equally distributed energy extreme points usually on frequency domain, so, if having 5 in the existing extreme point with pitch
αBe the point of spacing, its frequency band number F
mSatisfy F
m≈ mpitch
α, wherein
Perhaps there are 4 with pitch
αBe the point of spacing, its frequency band number F
mSatisfy F
m≈ mpitch
α, m=1 wherein, 2,3,4, then this frame signal may be pitch corresponding to fundamental tone
αVoiced sound; Otherwise this frame signal can not be corresponding to this voiced sound.
For this reason, to current selected tests fundamental tone pitch
α, from I
k(k=1..u) search and mpitch in
αImmediate value I
M ', I
M 'Satisfy | I
M '-mpitch
α|≤| I
M "-mpitch
α| (1≤m '≤u, 1≤m "≤u, m " ≠ m ').All corresponding to mpitch
α(wherein
I
M '(wherein
) be designated as set { P
1, P
2, P
3..., and increase an element P
0=0, form set { P
0, P
1, P
2, P
3....
Step 305: if set { P
0, P
1, P
2, P
3... 5 continuous element P of middle existence
t, P
T+1, P
T+2, P
T+3, P
T+4(wherein
) spacing equates, thinks that then The extreme value distribution and fundamental tone are complementary.Because frequency span is R, promptly frequency domain resolution can not be accomplished definitely accurately, so the exact value of fundamental tone can be expressed as the spacing average of these 5 elements.
Whether equate pair set { P for estimating 5 element spacings
0, P
1, P
2, P
3... in any 5 continuous elements, calculate its spacing D
1=P
T+1-P
t, D
2=P
T+2-P
T+1, D
3=P
T+3-P
T+2, D
4=P
T+4-P
T+3(wherein
)。Ask { D
1, D
2, D
3, D
4Average D, and, be expressed as its variance normalization
Var is more little, and then spacing is even more, and the possibility that contains voiced sound is also big more.Corresponding to same pitch
αA plurality of 5 element combinations are arranged, and select normalization variance minimum one group, record its spacing average is D
α, the normalization variance is Var
α
Step 306: if Var
α<Var
Thd(Var wherein
ThdBe empirical value, 0.001<Var
Thd<0.003), thinks that then when last frame signal may contain fundamental tone be D
αVoiced sound, D wherein
αRepresent the accurate fundamental frequency of this group spacing correspondence, and enter step 307.If at set { P
1, P
2, P
3... in, corresponding to the spacing { D of any 5 element combinations
1, D
2, D
3, D
4All do not satisfy Var<Var
Thd, then enter step 310.
Step 307: judge D
αOther accurate fundamental frequency of having preserved with present frame is very approaching.Process is as follows:
At fundamental tone pitch
αIn the search procedure of interior all integers of traversal [60/R] to [450/R], if fundamental tone pitch was once arranged
βAlso corresponding to uniform The extreme value distribution, i.e. pitch
αAnd pitch
βCorresponding Var
αAnd Var
βAll less than the Var in the step 306
Thd, Var then
βAnd D
βAll be saved.Check | D
α-D
β| whether greater than threshold value D
Pitch, D wherein
Pitch=D
Min/ R, D
MinBe empirical value (a 50<D
Min<150), unit is Hz.
If | D
α-D
β| 〉=D
Pitch, then enter step 308.
If | D
α-D
β|<D
Pitch, then enter step 309.
Step 308: at current accurate fundamental frequency D
αPairing spacing normalization variance is enough little, and all accurate fundamental frequencies enter this step all at a distance of enough greatly the time with other.Write down current apart from average D
αWith normalization variance Var
αAll pitch in [60/R] to [450/R]
αIn, if α=α
1, α
2, α
3... can both correspond to equally distributed extreme value, then keep their normalization variance and its accurate fundamental frequency, be designated as VAR={Var1, Var2...} and D
All={ D1, D2...}.This is because when input signal was the very low voiced sound of fundamental frequency, 2 frequencys multiplication of fundamental tone even 3 frequencys multiplication also all in human fundamental frequency scope, promptly were also included within pitch
αThe hunting zone in, and its corresponding extreme value spacing may be more even.So not necessarily corresponding real fundamental tone of normalization variance minimal data.Voiced sound detection algorithm commonly used is often by following the tracks of fundamental frequency location voiced sound, thereby is subjected to the interference of fundamental tone frequency multiplication easily and lost efficacy, and this method keeps multi-group data and in the end screens in conjunction with the front and back frame, can avoid fundamental tone frequency multiplication interference problem.
Step 309: in step 307 | D
α-D
β|<D
PitchThe time enter this step.In two approaching accurate fundamental frequencies of frequency, only keep less one of wherein normalization variance.If i.e.: Var
α<Var
β, then from set VAR and D
AllIn leave out D
βAnd Var
βRecord, and Var
αAnd D
αRecord wherein; Otherwise, then keep D
βAnd Var
βRecord, and do not write down D
αRelevant information.This is because if two very approaching fundamental tones all may exist, then may be because pitch
αBe subjected to spectral resolution restriction and coarse cause, select more uniform one group of The extreme value distribution, can keep the fundamental tone of approaching reality.
Step 310: judge whether to travel through all possible fundamental tone (i.e. [60/R]≤pitch
αAll integers in≤[450/R] scope), if the judgment is Yes, enter step 311; If the judgment is No, get back to step 304.
Step 311: judge whether that all frames are all handled in the buffer memory.Every frame signal is at process [60/R]≤pitch
αAfter the test of all fundamental tones among≤[450/R] (be each possible fundamental tone and search for the extreme value of mating most in current all extreme values), that finally remain is VAR and D
AllThe frame that has is not because all have satisfactory The extreme value distribution to any fundamental tone, possible its VAR and D
AllIt all is empty set.If also have signal not pass through the extremum search of step 302 to 310 in the buffer memory and, then return step 302,, then enter step 312 if all frames are all searched for and tested in the buffer memory to test that may fundamental tone.
Step 312: beginning is in conjunction with several frame informations in front and back, and whether the preliminary judgement harmonic wave exists.At first setting the first frame number t that detects is 1.
Step 313: detect t and whether can connect into harmonic wave to the t+3 frame, i.e. whether judgement exists voiced sound.Judgment condition is: in continuous 4 frame signals, if the D of any adjacent two frames
AllIn all have an element very approaching at least with the other side's element, then can think to have harmonic wave in this 4 frame signal.Specifically, as long as adjacent two frames D separately
AllIn have an element separately at least with the other side's D
AllIn certain element difference and among both the ratio of smaller value be no more than constant ratio, think that then they meet the requirements.The span of ratio is 10%~20%.Fig. 2 is the rough schematic that connects.As seen, in 4 frames (i.e. t, t+1, t+2 and t+3 frame), as long as any two continuous frames all has a paths to be communicated with, just can declare this 4 frame is voiced sound.
Why requiring must be communicated with between two frames, is because the fundamental frequency of voiced sound and harmonic wave all are gradual continuously, does not have sudden change between two frames.Like this, even certain segment signal equally distributed energy extreme value occurs because of interference, can not be communicated with to get up continuously mutually yet, thereby can not be judged to voiced sound.In addition, so the normalization variance that the fundamental tone frequency multiplication of some voiced sound is also corresponding very little is the D of each frame
AllIn may comprise fundamental frequency and frequency multiplication thereof respectively, therefore,, promptly in two frames of front and back a connecting path is arranged as long as fundamental frequency or frequency multiplication can be communicated with, just can think that the fundamental frequency of this two frame and harmonic wave are continuous, this has just prevented failing to judge of real voiced sound.
Step 314: the start frame numbering t that judges harmonic wave is increased by 1, and continuing to detect down, whether 4 frames are communicated with.
Step 315: judge whether 4 all in buffer memory successive frames all to be done the harmonic wave connection, that is,, then enter step 316, otherwise return step 313 if tested the L-3 frame that is over to continuous 4 frames of L frame.
Step 316: because voiced sound is continuously gradual, so when two sections voiced sounds very near the time, can think that the centre also is a voiced sound for one section.Therefore, information before and after utilizing is done further connection and shaping to declaring the voiced sound that in the buffer memory, and being in the buffer memory between two sections voiced sounds, and time span all is judged to voiced sound less than 20 milliseconds non-voiced sound signal segment.
Step 317: judge currently whether also have new digitized sound signal to need to handle.If also have the signal that needs processing, enter step 318; If do not had the new signal input, then entered step 319.
Step 318: export the voiced sound court verdict of the 1st frame, abandon the content of buffer memory first frame, buffer memory the 2nd frame is moved forward 1 frame respectively to the L frame signal, corresponding every frame result of calculation also moves forward respectively; A new frame of digital voice signal is deposited in buffer memory L frame, and get back to step 302.
Step 319: the voiced sound testing result of every frame signal in the output buffers, the harmonic wave testing process of all signals finishes.
Claims (6)
1, a kind of voiced sound detection method based on harmonic characteristic is characterized in that, comprises the steps:
1) digitized signal of input is carried out pre-service, this pre-service comprises branch frame, windowing, pre-emphasis; Adopt one section buffer memory, preserve length and be 200~400 milliseconds signal and results of intermediate calculations thereof, to every frame signal zero padding and carry out leaf transformation in the N point discrete Fourier; N=2 wherein
x, x is an integer, x 〉=8;
2), calculate the energy of this each frequency band of frame respectively, and search for the local extremum of frequency domain energy in view of the above to each frame signal;
3) according to the harmonic wave characteristics of human voiced sound, to step 2) in the frequency band energy extreme value that searches and all fundamental tones of human fundamental tone scope mate, find the fundamental tone of coupling;
4) in all fundamental tones that are complementary with a frame signal, merge wherein at a distance of nearer fundamental tone, and these coupling fundamental tones that remain are formed a set;
5), the frame of some is done the voiced sound judgement of globality in conjunction with front and back information according to phonetic feature;
6) current cache finishes as calculated, if finished the detection of all signals, then exports the result, otherwise the signal in the mobile update buffer memory, and get back to step 1), continue that the signal in the current cache is carried out harmonic wave and detect.
2,, it is characterized in that described step 2 by the described voiced sound detection method of claim 1 based on harmonic characteristic) comprise following substep:
21) choose a frame earlier, calculate the energy of N/2 frequency band of this frame, be expressed as ε from low to high by frequency
1, ε
2... ε
N/2, the minimum value of frequency band energy is designated as ε
Min
22) frequency band number bin from 2 to
Scope in, if frequency band energy ε
BinSatisfy ε simultaneously
Bin>ε
Bin-1, ε
Bin>ε
Bin+1, and ε
Bin>M ε
Min, mark ε then
BinBe the energy extreme value, wherein M is an empirical constant, 5<M<20;
23) establishing the energy extreme value number that searches is u, by frequency order from low to high, recording step 22) in satisfy the ε of extremum conditions
BinSubscript position, be expressed as Ik, k=1 wherein, 2 ... u.
3, by the described voiced sound detection method of claim 1, it is characterized in that described step 3) comprises following substep based on harmonic characteristic:
31) all possible fundamental frequency pitch of traversal
α, its scope is [60/R]≤pitch
α≤ [450/R], wherein the maximum integer less than this value is got in [] expression, and R is the width of each frequency band, R=sampling rate/N;
32) to the fundamental frequency pitch of current detection
αIf in existing extreme point, find 5 with pitch
αBe the point of spacing, perhaps find 4 with pitch
αBe the point of spacing, and comprise that is positioned at a pitch in these 4 points
αPoint, then preliminary determine when last frame signal may be pitch corresponding to fundamental frequency
αVoiced sound; If the normalization variance of spacing is less than certain threshold value Var between these points
Thd, think that then spacing is even, the average of spacing between the most even extreme point that searches as accurate fundamental frequency;
33) behind all possible fundamental frequency of traversal, record is when pairing each accurate fundamental frequency that may exist of last frame signal and corresponding normalization variance.
4,, it is characterized in that described step 33 by the described voiced sound detection method of claim 3 based on harmonic characteristic) finish after, enter step 4), described step 4) comprises following substep:
41) when step 33) in differ less than D between two accurate fundamental frequencies satisfying condition
PitchThe time, only write down less accurate fundamental frequency of wherein normalization variance and corresponding normalization variance thereof; D wherein
Pitch=D
Min/ R, D
MinBe an empirical value, its unit is Hz, and span is 50<D
Min<150;
42) at scope [60/R]≤pitch
αAll pitch in≤[450/R]
αIn, if α=α
1, α
2, α
3.... can both correspond to equally distributed extreme value, and be not less than step 41 apart) in threshold value D
Pitch, then each self-corresponding accurate fundamental frequency is designated as a set.
5, by the described voiced sound detection method of claim 4, it is characterized in that enter step 5) after described step 4) is finished, described step 5) comprises following substep based on harmonic characteristic:
51) in 4 frame signals of front and back, if arbitrarily in the set that the accurate fundamental frequency of adjacent two frames is formed, all there is an element at least, very approaching with an element in the other side's set, then can think to have voiced sound in this 4 frame signal; The standard that two elements are close is: both differences and among both the ratio of smaller value be no more than proportionality constant ratio, wherein the scope of ratio is 10%~20%;
52) in step 51) make preliminary ruling after, being in the buffer memory between two sections voiced sounds, and length all is judged to voiced sound less than 20 milliseconds non-voiced sound signal segment.
6, by the described voiced sound detection method of claim 5, it is characterized in that enter step 6) after described step 5) is finished, described step 6) comprises following substep based on harmonic characteristic:
61) if all signals are all finished after testing, do not have new input signal, then export end product;
62) if also have signal not detect, then buffer memory the 2nd frame is moved forward 1 frame respectively to the L frame signal, corresponding every frame result of calculation also moves forward respectively; A new frame of digital voice signal input-buffer L frame, and get back to step 1), continue that the signal in the current cache is carried out harmonic wave and detect.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200510089956A CN100580768C (en) | 2005-08-08 | 2005-08-08 | Voiced sound detection method based on harmonic characteristic |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200510089956A CN100580768C (en) | 2005-08-08 | 2005-08-08 | Voiced sound detection method based on harmonic characteristic |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1912992A true CN1912992A (en) | 2007-02-14 |
CN100580768C CN100580768C (en) | 2010-01-13 |
Family
ID=37721903
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200510089956A Expired - Fee Related CN100580768C (en) | 2005-08-08 | 2005-08-08 | Voiced sound detection method based on harmonic characteristic |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100580768C (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102655000A (en) * | 2011-03-04 | 2012-09-05 | 华为技术有限公司 | Method and device for classifying unvoiced sound and voiced sound |
CN103310800A (en) * | 2012-03-06 | 2013-09-18 | 中国科学院声学研究所 | Voiced speech detection method and voiced speech detection system for preventing noise interference |
CN103886871A (en) * | 2014-01-28 | 2014-06-25 | 华为技术有限公司 | Detection method of speech endpoint and device thereof |
CN104700843A (en) * | 2015-02-05 | 2015-06-10 | 海信集团有限公司 | Method and device for identifying ages |
CN104934032A (en) * | 2014-03-17 | 2015-09-23 | 华为技术有限公司 | Method and device for voice signal processing according to frequency domain energy |
CN105067101A (en) * | 2015-08-05 | 2015-11-18 | 北方工业大学 | Fundamental tone frequency characteristic extraction method based on vibration signal for vibration source identification |
CN106356076A (en) * | 2016-09-09 | 2017-01-25 | 北京百度网讯科技有限公司 | Method and device for detecting voice activity on basis of artificial intelligence |
CN106680585A (en) * | 2017-01-05 | 2017-05-17 | 攀枝花学院 | Detection method of harmonics/inter-harmonics |
CN107833581A (en) * | 2017-10-20 | 2018-03-23 | 广州酷狗计算机科技有限公司 | A kind of method, apparatus and readable storage medium storing program for executing of the fundamental frequency for extracting sound |
CN110111811A (en) * | 2019-04-18 | 2019-08-09 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio signal detection method, device and storage medium |
CN110739006A (en) * | 2019-10-16 | 2020-01-31 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio processing method and device, storage medium and electronic equipment |
CN111061909A (en) * | 2019-11-22 | 2020-04-24 | 腾讯音乐娱乐科技(深圳)有限公司 | Method and device for classifying accompaniment |
CN112885380A (en) * | 2021-01-26 | 2021-06-01 | 腾讯音乐娱乐科技(深圳)有限公司 | Method, device, equipment and medium for detecting unvoiced and voiced sounds |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1151490C (en) * | 2000-09-13 | 2004-05-26 | 中国科学院自动化研究所 | High-accuracy high-resolution base frequency extracting method for speech recognization |
-
2005
- 2005-08-08 CN CN200510089956A patent/CN100580768C/en not_active Expired - Fee Related
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102655000B (en) * | 2011-03-04 | 2014-02-19 | 华为技术有限公司 | Method and device for classifying unvoiced sound and voiced sound |
CN102655000A (en) * | 2011-03-04 | 2012-09-05 | 华为技术有限公司 | Method and device for classifying unvoiced sound and voiced sound |
CN103310800A (en) * | 2012-03-06 | 2013-09-18 | 中国科学院声学研究所 | Voiced speech detection method and voiced speech detection system for preventing noise interference |
CN103310800B (en) * | 2012-03-06 | 2015-10-07 | 中国科学院声学研究所 | A kind of turbid speech detection method of anti-noise jamming and system |
CN103886871A (en) * | 2014-01-28 | 2014-06-25 | 华为技术有限公司 | Detection method of speech endpoint and device thereof |
CN103886871B (en) * | 2014-01-28 | 2017-01-25 | 华为技术有限公司 | Detection method of speech endpoint and device thereof |
CN104934032B (en) * | 2014-03-17 | 2019-04-05 | 华为技术有限公司 | The method and apparatus that voice signal is handled according to frequency domain energy |
CN104934032A (en) * | 2014-03-17 | 2015-09-23 | 华为技术有限公司 | Method and device for voice signal processing according to frequency domain energy |
CN104700843A (en) * | 2015-02-05 | 2015-06-10 | 海信集团有限公司 | Method and device for identifying ages |
CN105067101A (en) * | 2015-08-05 | 2015-11-18 | 北方工业大学 | Fundamental tone frequency characteristic extraction method based on vibration signal for vibration source identification |
CN106356076B (en) * | 2016-09-09 | 2019-11-05 | 北京百度网讯科技有限公司 | Voice activity detector method and apparatus based on artificial intelligence |
CN106356076A (en) * | 2016-09-09 | 2017-01-25 | 北京百度网讯科技有限公司 | Method and device for detecting voice activity on basis of artificial intelligence |
CN106680585B (en) * | 2017-01-05 | 2019-01-29 | 攀枝花学院 | Harmonic wave/m-Acetyl chlorophosphonazo detection method |
CN106680585A (en) * | 2017-01-05 | 2017-05-17 | 攀枝花学院 | Detection method of harmonics/inter-harmonics |
CN107833581A (en) * | 2017-10-20 | 2018-03-23 | 广州酷狗计算机科技有限公司 | A kind of method, apparatus and readable storage medium storing program for executing of the fundamental frequency for extracting sound |
CN107833581B (en) * | 2017-10-20 | 2021-04-13 | 广州酷狗计算机科技有限公司 | Method, device and readable storage medium for extracting fundamental tone frequency of sound |
CN110111811A (en) * | 2019-04-18 | 2019-08-09 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio signal detection method, device and storage medium |
CN110111811B (en) * | 2019-04-18 | 2021-06-01 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio signal detection method, device and storage medium |
CN110739006A (en) * | 2019-10-16 | 2020-01-31 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio processing method and device, storage medium and electronic equipment |
CN111061909A (en) * | 2019-11-22 | 2020-04-24 | 腾讯音乐娱乐科技(深圳)有限公司 | Method and device for classifying accompaniment |
CN111061909B (en) * | 2019-11-22 | 2023-11-28 | 腾讯音乐娱乐科技(深圳)有限公司 | Accompaniment classification method and accompaniment classification device |
CN112885380A (en) * | 2021-01-26 | 2021-06-01 | 腾讯音乐娱乐科技(深圳)有限公司 | Method, device, equipment and medium for detecting unvoiced and voiced sounds |
Also Published As
Publication number | Publication date |
---|---|
CN100580768C (en) | 2010-01-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1912992A (en) | Voiced sound detection method based on harmonic characteristic | |
CN1912993A (en) | Voice end detection method based on energy and harmonic | |
CN106409310B (en) | A kind of audio signal classification method and apparatus | |
CN101872616B (en) | Endpoint detection method and system using same | |
CN102819009B (en) | Driver sound localization system and method for automobile | |
CN102356427B (en) | Noise suppression device | |
CN102044246B (en) | Method and device for detecting audio signal | |
CN108899052B (en) | Parkinson speech enhancement method based on multi-band spectral subtraction | |
CN104599677B (en) | Transient noise suppressing method based on speech reconstructing | |
CN103646649A (en) | High-efficiency voice detecting method | |
CN1335980A (en) | Wide band speech synthesis by means of a mapping matrix | |
CN101145345B (en) | Audio frequency classification method | |
CN101996640B (en) | Frequency band expansion method and device | |
CN106650576A (en) | Mining equipment health state judgment method based on noise characteristic statistic | |
CN102097095A (en) | Speech endpoint detecting method and device | |
CN102592589B (en) | Speech scoring method and device implemented through dynamically normalizing digital characteristics | |
CN103117067A (en) | Voice endpoint detection method under low signal-to-noise ratio | |
CN113724712B (en) | Bird sound identification method based on multi-feature fusion and combination model | |
Ealey et al. | Harmonic tunnelling: tracking non-stationary noises during speech. | |
CN100541609C (en) | A kind of method and apparatus of realizing open-loop pitch search | |
Hansen et al. | Iterative speech enhancement with spectral constraints | |
Wu et al. | A pitch-based method for the estimation of short reverberation time | |
Ouzounov | A robust feature for speech detection | |
Sorin et al. | The ETSI extended distributed speech recognition (DSR) standards: client side processing and tonal language recognition evaluation | |
Vahatalo et al. | Voice activity detection for GSM adaptive multi-rate codec |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20100113 |
|
CF01 | Termination of patent right due to non-payment of annual fee |