CN1912992A - Voiced sound detection method based on harmonic characteristic - Google Patents

Voiced sound detection method based on harmonic characteristic Download PDF

Info

Publication number
CN1912992A
CN1912992A CNA2005100899564A CN200510089956A CN1912992A CN 1912992 A CN1912992 A CN 1912992A CN A2005100899564 A CNA2005100899564 A CN A2005100899564A CN 200510089956 A CN200510089956 A CN 200510089956A CN 1912992 A CN1912992 A CN 1912992A
Authority
CN
China
Prior art keywords
frame
signal
voiced sound
pitch
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2005100899564A
Other languages
Chinese (zh)
Other versions
CN100580768C (en
Inventor
国雁萌
付强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN200510089956A priority Critical patent/CN100580768C/en
Publication of CN1912992A publication Critical patent/CN1912992A/en
Application granted granted Critical
Publication of CN100580768C publication Critical patent/CN100580768C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Electrophonic Musical Instruments (AREA)

Abstract

This invention relates to a method for testing sonant signals in a noisy environment including: carrying out pre-process and N point scatter Fourier transformation to each frame signal, computing energy of each band of a frame then searching for the extremum of the energy, matching the band energy extremum with all possible keynotes in human keynote sphere to find out possible existent keynotes according to the harmonic character of the human sonant, making integral sonant judgment to certain numbers of frames based on the phonetic character, shifting and updating buffer-storage then continuing testing sonant in the updated buffer storage. Advantage: stable performance can be maintained even if in the environment of low S/N ratio or changeable noises and accurate positions of sonant can be tested even though the keynote is low.

Description

A kind of voiced sound detection method based on harmonic characteristic
Technical field
The present invention relates to field of voice signal, particularly detect a kind of method of voiced sound in the noise circumstance.
Background technology
In the practical speech signal processing system, need to detect the position of voiced sound under a lot of situations.For example, detect in (Voice Activity Detection) system,, can judge roughly whether voice exist by detecting voiced sound at speech; In the speech signal segments system, detect the division that the voiced sound position helps signal segment.
Voiced sound is the cycle with the fundamental tone on time domain, has harmonic structure on frequency domain, therefore, voiced sound detection method commonly used mainly utilize the fundamental tone feature of voiced sound and harmonic wave in all frequency bands total intensity and judge the voiced sound position.But when voice signal was subjected to noise, these method performances can descend to some extent.For example, Irwin proposed the periodically strong and weak mode of a kind of assessing signal in 1979, be called least square and periodically estimate (least-squares periodicity estimation), R.Tucker has carried out the sound end detection in this way in 1992 annual interests, but the method is easy to be subjected to low frequency noises.Yutaka Kobayashi proposed in 1980, and the search extreme value can be sought voiced sound in the cepstrum of signal.But this method can not be used under the low condition of signal to noise ratio (S/N ratio) equally, and is difficult to tackle low-frequency noise.Calendar year 2001, Tong Zhang is voice in the detection signal and music when doing the audio-video signal segmentation, has used a kind of method of fundamental frequency (short-timefundamental frequency is called for short SFuF) in short-term that is called.SFuF estimates autoregressive model by calculating auto-correlation in time domain, forms level and smooth spectrum again, and from spectrum its limit of search and obtain fundamental frequency, find the position of voiced sound with this.But this method is easy to by noise, also can be because of the frequency multiplication of fundamental frequency being mistakened as into fundamental frequency and losing efficacy.2003, Ahmad R.Abu-El-Quran divided time-like carrying out sound signal, proposed a kind of algorithm that detects voice in signal.This method is at first carried out low-pass filtering and center clipping and is calculated auto-correlation every frame signal, searches for extreme value then in autocorrelation function, judges whether this frame signal contains harmonic wave.By calculating the shared ratio of harmonic wave frame in the segment signal, judge whether this segment signal contains voice at last.But this method is subjected to the interference of fundamental tone frequency multiplication equally easily, and is difficult to be applied in the stronger or changeable situation of noise.
In the actual speech signal processing system, input signal is in the signal to noise ratio (S/N ratio) difference of each frequency band, noise fundamental tone frequently near energy by force the time, although harmonic characteristic clearly, detect but fundamental tone is difficult, so search for the method for voiced sound and be easy to be interfered by detecting fundamental tone.Equally, if detect the harmonic wave of voiced sound at the full range band, then the frequency band that signal to noise ratio (S/N ratio) is low also can influence voiced sound and detects.Therefore, this algorithm is only searched for 4~5 harmonic waves the most clearly, thereby avoids the low frequency band of signal to noise ratio (S/N ratio) automatically, can be when noise be strong steady operation, and insensitive to the variation of noise.
Because harmonic wave and fundamental tone have been concentrated the main energy of voiced sound, and harmonic frequency is the integral multiple of fundamental frequency, so there is equally distributed energy extreme value in pure voiced sound on frequency domain, and it equals fundamental frequency at interval.Even the voiced sound signal is subjected to sound pick-up outfit and interference of noise, also can keep 4~5 equidistant energy extreme values usually at frequency domain, the present invention that Here it is detects the main foundation of voiced sound by harmonic characteristic.
Summary of the invention
The objective of the invention is in noisy speech signal, to detect the position of voiced sound.
For achieving the above object, the voiced sound detection method based on harmonic characteristic provided by the invention comprises the steps:
1) digitized sound signal to input carries out pre-service such as branch frame, windowing, pre-emphasis, and sets buffer memory, in order to temporary L frame signal (total length of time is 200~300 milliseconds) and intermediate result thereof, top L frame input signal is deposited in buffer memory;
2) establishing frame length is F, and to each frame signal in the buffer memory, first zero padding is to N point (N 〉=F wherein, N=2 x, x is integer and x 〉=8), carry out leaf transformation in the N point discrete Fourier then.Calculate each frequency band energy according to the Fourier transform value, and search for the frequency domain energy extreme value of this frame signal with this;
3) to each frame signal in the buffer memory, according to human voiced sound in the characteristic aspect the frequency domain energy, to step 2) in the institute of the frequency domain energy extreme point that searches and human fundamental frequency scope might mate by fundamental tone the fundamental tone that finds this frame to exist;
4), in all possible fundamental tone that in step 3), searches, merge its medium frequency and differ less fundamental tone, and these fundamental tones that remain are formed a set to each frame signal in the buffer memory;
5) according to the fundamental tone and the continuous gradual characteristic of each harmonic of voiced sound, the information of front and back frame in the binding cache is done the voiced sound existence judgement of globality to the frame of some;
6) if do not had new digitized sound signal input, then the voiced sound testing result and the detection of end of all frames in the output buffers; Otherwise, export the voiced sound court verdict of the 1st frame, abandon the content of buffer memory first frame, buffer memory the 2nd to L frame signal and corresponding result of calculation thereof move forward a frame respectively, and import a new frame signal to buffer memory L frame, this frame new signal is carried out pre-service, gets back to step 2), continue the signal that upgrades in the buffer memory of back is carried out the harmonic wave detection.
In the technique scheme, described step 2) comprise following substep:
21) establishing frame length is F, chooses a frame signal in the buffer memory earlier, and zero padding is to N point (N 〉=F wherein, N=2 x, x is integer and x 〉=8), carry out leaf transformation in the N point discrete Fourier then.Calculate the energy of N/2 frequency band of this frame, be expressed as ε from low to high by frequency 1, ε 2... ε N/2, the minimum value of frequency band energy is designated as ε Min
22) be bin (the frequency band energy ε of 1<bin<N/2) wherein to frequency band number BinIf, ε BinSatisfy ε simultaneously Bin>ε Bin-1, ε Bin>ε Bin+1, and ε Bin>M ε Min, mark ε then BinBe the energy extreme value, wherein M is an empirical constant, 5<M<20;
23) establishing the energy extreme value number that finds is u, by frequency order from low to high, recording step 22) in satisfy the ε of extremum conditions BinSubscript position (being the value of bin), be expressed as I k(k=1..u).
In the technique scheme, described step 3) comprises following substep:
Step 31) according to human fundamental tone scope (60~450Hz), travel through all possible fundamental frequency pitch αWherein, pitch αScope be [60/R]≤pitch α≤ [450/R], maximum integer less than this value is got in [] expression, and R=sampling rate/N represents the width of each frequency band.
Step 32) to the fundamental frequency pitch of current detection αIf in existing extreme point, find 5 with pitch αBe the point of spacing, perhaps find 4 with pitch αBe the point of spacing, and comprise that is positioned at a pitch in these 4 points αPoint, then preliminary determine when last frame signal may be pitch corresponding to fundamental frequency αVoiced sound.Therefore, if according to pitch αThe normalization variance of spacing satisfies certain threshold value Var between the most uniform 4 or 5 extreme points that search Thd, then think pitch αMight exist, and the spacing average of the most even extreme point as accurate fundamental frequency;
Step 33) behind all possible fundamental frequency of traversal, record is when pairing each the accurate fundamental frequency that may exist of last frame signal, and the pairing normalization variance of this accurate fundamental frequency.
In the technique scheme, described step 4) comprises following substep:
41) when satisfying step 32) differ less than D between two accurate fundamental frequencies of condition PitchThe time, select to write down less accurate fundamental frequency of wherein normalization variance and corresponding normalization variance thereof; D wherein Pitch=D Min/ R, D MinBe empirical value (a 50<D Min<150), unit is Hz.
42) at all pitch α([60/R]≤pitch α≤ [450/R]) in, if α=α 1, α 2, α 3.... can both correspond to equally distributed extreme value, and accurately the mutual gap of fundamental frequency all meets step 41) condition, then each self-corresponding accurate fundamental frequency is designated as a set.
In the technique scheme, described step 5) comprises following substep:
51) in buffer memory in continuous 4 frame signals, if arbitrarily in the set that the accurate fundamental frequency of adjacent two frames is formed, all there is an element at least, very approaching with an element in the other side's set, then can think to have voiced sound in this 4 frame signal.The standard that two elements are close is: both differences and among both the ratio of smaller value be no more than constant ratio (wherein 10%<ratio<20%);
52) in step 51) make preliminary ruling after, being in the buffer memory between two sections voiced sounds, and time span all is judged to voiced sound less than 20 milliseconds non-voiced sound signal segment.
In the technique scheme, described step 6) comprises following substep:
61) if do not had new digitized sound signal input, then the voiced sound testing result and the detection of end of all frames in the output buffers;
62) if current also have the input of new digitized signal, then export the voiced sound court verdict of the 1st frame, abandon the content of buffer memory first frame, buffer memory the 2nd frame is moved forward 1 frame respectively to the L frame signal, corresponding every frame result of calculation also moves forward respectively; A new frame of digital voice signal is deposited in buffer memory L frame, this frame new signal is carried out pre-service, and get back to step 2), continue that signal in the buffer memory after upgrading is carried out harmonic wave and detect.
Compared with prior art, advantage of the present invention is: the present invention is search voiced sound harmonic wave frequency band the most clearly in all frequency bands automatically, so even lower or noise changes faster in the environment in signal to noise ratio (S/N ratio), also can keep stability; The present invention utilizes the harmonic wave of voiced sound and fundamental tone mutually continuously, and harmonic wave is that the characteristic of fundamental tone frequency multiplication detects voiced sound, so do not need to determine concrete fundamental frequency, can not disturbed by the frequency multiplication of fundamental tone as the method for searching for voiced sound by the detection fundamental frequency, even fundamental frequency is lower, the present invention also can detect the position of voiced sound; The present invention has utilized human pronunciation mechanism to change universal feature slowly, and to the appearance of voiced sound and duration without limits, and the fundamental tone scope of search comprises the possible fundamental tone of institute of human speech, thereby can be adapted to the speaker of all kinds of ages, sex and the custom of speaking; The present invention detects voiced sound by the search harmonic wave, and the harmonic characteristic of voiced sound is all more stable under different recording and transmission conditions, so the present invention does not have specific (special) requirements to the recording and the transmission conditions of signal, even fundamental frequency frequency range disappearance, also can accurately detect the position of voiced sound, this voiced sound that also is adapted to especially in the telephone channel detects.The present invention can be applied to sound end detection, audio-video signal segmentation, speech segmentation system, and in the pre-service of voice coding and speech-enhancement system.
Description of drawings
Fig. 1 is the process flow diagram of the voiced sound detection method based on harmonic characteristic provided by the invention;
Fig. 2 is the harmonic wave judgement synoptic diagram of continuous 4 frames among the present invention.
Embodiment
In the practical speech signal processing system, voice signal often is subjected to noise, and noise is not only inhomogeneous in the energy distribution of each frequency band, and constantly changes.Therefore, noisy speech signal is in the signal to noise ratio (S/N ratio) difference of each frequency band, and changes along with the variation of voice and noise.When near the energy fundamental tone frequency band of noise at voiced sound was strong, although the harmonic characteristic of voiced sound signal clearly, fundamental tone is difficult to be detected, so be easy to be interfered by detecting the method that fundamental tone detects voiced sound.Equally, because the signal to noise ratio (S/N ratio) of some frequency band is lower, if search for voiced sound by detecting harmonic wave at the full range band, then the frequency band that signal to noise ratio (S/N ratio) is low can reduce the overall reliability of detection.Therefore, this algorithm is only searched for 4~5 harmonic waves the most clearly, thereby avoids the low frequency band of signal to noise ratio (S/N ratio) automatically, can be when noise be strong steady operation, and insensitive to the variation of noise.
Because harmonic wave and fundamental tone have been concentrated the main energy of voiced sound, and harmonic frequency is the integral multiple of fundamental frequency, so there is equally distributed energy extreme value in pure voiced sound on frequency domain, and it equals pitch period at interval.Even the voiced sound signal is subjected to sound pick-up outfit and interference of noise, also can keep 4~5 equidistant energy extreme values usually at frequency domain, the present invention that Here it is detects the main foundation of voiced sound by harmonic characteristic.
Below in conjunction with drawings and the specific embodiments the present invention is done description further.
Embodiment:
As shown in Figure 1, the concrete steps of present embodiment are as follows:
Step 301: L frame of digital signal is carried out pre-service respectively and deposits buffer memory in, prepare to detect therein voiced sound.
Step 302: take out the frame signal (establish it and be i frame in the buffer memory) also do not carry out harmonic wave and to detect in buffer memory, carry out pre-service such as windowing and pre-emphasis according to system's concrete condition, zero padding is to the N point, (N 〉=F wherein, N=2 x, x is integer and x 〉=8), carry out leaf transformation in the N point discrete Fourier then, obtain discrete spectrum X ( i , bin ) = Σ n = 0 N - 1 x ( i , n ) e - j ( 2 π / N ) n · bin , Bin=0,1 ..., N-1, wherein (bin is the frequency spectrum numbering to x for i, n) n sampled value of i frame signal in the expression buffer memory.Delivery square, each frequency band energy ε Bin=| X (i, bin) | 2Wherein bin is a frequency band number, 1≤bin≤N/2, the width R=sampling rate/N of each frequency band.
Step 303: the minimum value of frequency band energy is designated as ε Min, at ε Bin(wherein among 1<bin<N/2), if ε BinSatisfy ε simultaneously Bin>ε Bin-1, ε Bin>ε Bin+1, and ε Bin>M ε Min, mark ε then BinBe the energy extreme value, it might be corresponding to harmonic wave or fundamental tone, also may be corresponding to the noise of chance or the minor swing of voice spectrum.Wherein M is an empirical constant, 5<M<20.If satisfy total u of the frequency band of this requirement, by frequency order from low to high, writing down its position (is ε BinSubscript bin) I k(k=1..u).Why require ε Bin>M ε Min, be because when having harmonic wave in the signal, signal energy distributes uneven, ε MinDiffer bigger, ε with the harmonic band energy BinHave only and satisfy this condition, just may be corresponding to harmonic wave or fundamental tone.
Step 304: to possible fundamental tone, search approaches the The extreme value distribution of this fundamental tone most, its objective is to determine to work as last frame signal may corresponding to which voiced sound.Human fundamental tone scope is 60~450Hz, if use pitch αThe frequency band number of expression fundamental tone correspondence, then [60/R]≤pitch α≤ [450/R].Wherein, the maximum integer less than this value is got in [] expression, and R is the width of each frequency band, i.e. frequency domain resolution (seeing step 302).Because the energy advantage of fundamental tone and harmonic wave is more obvious at low frequency, thus only the extreme value in 60~2000Hz scope is mated, corresponding to [60/R]≤I k≤ [2000/R].
Therefore, the process of search is: fundamental tone pitch αTraversal [60/R] all integers in [450/R] are searched for and pitch in being numbered [60/R] energy extreme point in [2000/R] scope respectively αAnd the approaching point in harmonic wave position.Because voiced sound shows at least 4~5 equally distributed energy extreme points usually on frequency domain, so, if having 5 in the existing extreme point with pitch αBe the point of spacing, its frequency band number F mSatisfy F m≈ mpitch α, wherein m = 2,3 , . . [ 2000 R · pitch α ] ; Perhaps there are 4 with pitch αBe the point of spacing, its frequency band number F mSatisfy F m≈ mpitch α, m=1 wherein, 2,3,4, then this frame signal may be pitch corresponding to fundamental tone αVoiced sound; Otherwise this frame signal can not be corresponding to this voiced sound.
For this reason, to current selected tests fundamental tone pitch α, from I k(k=1..u) search and mpitch in αImmediate value I M ', I M 'Satisfy | I M '-mpitch α|≤| I M "-mpitch α| (1≤m '≤u, 1≤m "≤u, m " ≠ m ').All corresponding to mpitch α(wherein m = 1,2 , . . [ 2000 R · pitch α ] ) I M '(wherein m ′ = 1,2 , . . [ 2000 R · pitch α ] ) be designated as set { P 1, P 2, P 3..., and increase an element P 0=0, form set { P 0, P 1, P 2, P 3....
Step 305: if set { P 0, P 1, P 2, P 3... 5 continuous element P of middle existence t, P T+1, P T+2, P T+3, P T+4(wherein t = 0,1 , . . [ 2000 R · pitch α - 4 ] ) spacing equates, thinks that then The extreme value distribution and fundamental tone are complementary.Because frequency span is R, promptly frequency domain resolution can not be accomplished definitely accurately, so the exact value of fundamental tone can be expressed as the spacing average of these 5 elements.
Whether equate pair set { P for estimating 5 element spacings 0, P 1, P 2, P 3... in any 5 continuous elements, calculate its spacing D 1=P T+1-P t, D 2=P T+2-P T+1, D 3=P T+3-P T+2, D 4=P T+4-P T+3(wherein t = 0,1 , . . [ 2000 R · pitch α - 4 ] )。Ask { D 1, D 2, D 3, D 4Average D, and, be expressed as its variance normalization Var = Σ q = 1 4 ( D q - D ‾ ) 2 4 D ‾ 2 . Var is more little, and then spacing is even more, and the possibility that contains voiced sound is also big more.Corresponding to same pitch αA plurality of 5 element combinations are arranged, and select normalization variance minimum one group, record its spacing average is D α, the normalization variance is Var α
Step 306: if Var α<Var Thd(Var wherein ThdBe empirical value, 0.001<Var Thd<0.003), thinks that then when last frame signal may contain fundamental tone be D αVoiced sound, D wherein αRepresent the accurate fundamental frequency of this group spacing correspondence, and enter step 307.If at set { P 1, P 2, P 3... in, corresponding to the spacing { D of any 5 element combinations 1, D 2, D 3, D 4All do not satisfy Var<Var Thd, then enter step 310.
Step 307: judge D αOther accurate fundamental frequency of having preserved with present frame is very approaching.Process is as follows:
At fundamental tone pitch αIn the search procedure of interior all integers of traversal [60/R] to [450/R], if fundamental tone pitch was once arranged βAlso corresponding to uniform The extreme value distribution, i.e. pitch αAnd pitch βCorresponding Var αAnd Var βAll less than the Var in the step 306 Thd, Var then βAnd D βAll be saved.Check | D α-D β| whether greater than threshold value D Pitch, D wherein Pitch=D Min/ R, D MinBe empirical value (a 50<D Min<150), unit is Hz.
If | D α-D β| 〉=D Pitch, then enter step 308.
If | D α-D β|<D Pitch, then enter step 309.
Step 308: at current accurate fundamental frequency D αPairing spacing normalization variance is enough little, and all accurate fundamental frequencies enter this step all at a distance of enough greatly the time with other.Write down current apart from average D αWith normalization variance Var αAll pitch in [60/R] to [450/R] αIn, if α=α 1, α 2, α 3... can both correspond to equally distributed extreme value, then keep their normalization variance and its accurate fundamental frequency, be designated as VAR={Var1, Var2...} and D All={ D1, D2...}.This is because when input signal was the very low voiced sound of fundamental frequency, 2 frequencys multiplication of fundamental tone even 3 frequencys multiplication also all in human fundamental frequency scope, promptly were also included within pitch αThe hunting zone in, and its corresponding extreme value spacing may be more even.So not necessarily corresponding real fundamental tone of normalization variance minimal data.Voiced sound detection algorithm commonly used is often by following the tracks of fundamental frequency location voiced sound, thereby is subjected to the interference of fundamental tone frequency multiplication easily and lost efficacy, and this method keeps multi-group data and in the end screens in conjunction with the front and back frame, can avoid fundamental tone frequency multiplication interference problem.
Step 309: in step 307 | D α-D β|<D PitchThe time enter this step.In two approaching accurate fundamental frequencies of frequency, only keep less one of wherein normalization variance.If i.e.: Var α<Var β, then from set VAR and D AllIn leave out D βAnd Var βRecord, and Var αAnd D αRecord wherein; Otherwise, then keep D βAnd Var βRecord, and do not write down D αRelevant information.This is because if two very approaching fundamental tones all may exist, then may be because pitch αBe subjected to spectral resolution restriction and coarse cause, select more uniform one group of The extreme value distribution, can keep the fundamental tone of approaching reality.
Step 310: judge whether to travel through all possible fundamental tone (i.e. [60/R]≤pitch αAll integers in≤[450/R] scope), if the judgment is Yes, enter step 311; If the judgment is No, get back to step 304.
Step 311: judge whether that all frames are all handled in the buffer memory.Every frame signal is at process [60/R]≤pitch αAfter the test of all fundamental tones among≤[450/R] (be each possible fundamental tone and search for the extreme value of mating most in current all extreme values), that finally remain is VAR and D AllThe frame that has is not because all have satisfactory The extreme value distribution to any fundamental tone, possible its VAR and D AllIt all is empty set.If also have signal not pass through the extremum search of step 302 to 310 in the buffer memory and, then return step 302,, then enter step 312 if all frames are all searched for and tested in the buffer memory to test that may fundamental tone.
Step 312: beginning is in conjunction with several frame informations in front and back, and whether the preliminary judgement harmonic wave exists.At first setting the first frame number t that detects is 1.
Step 313: detect t and whether can connect into harmonic wave to the t+3 frame, i.e. whether judgement exists voiced sound.Judgment condition is: in continuous 4 frame signals, if the D of any adjacent two frames AllIn all have an element very approaching at least with the other side's element, then can think to have harmonic wave in this 4 frame signal.Specifically, as long as adjacent two frames D separately AllIn have an element separately at least with the other side's D AllIn certain element difference and among both the ratio of smaller value be no more than constant ratio, think that then they meet the requirements.The span of ratio is 10%~20%.Fig. 2 is the rough schematic that connects.As seen, in 4 frames (i.e. t, t+1, t+2 and t+3 frame), as long as any two continuous frames all has a paths to be communicated with, just can declare this 4 frame is voiced sound.
Why requiring must be communicated with between two frames, is because the fundamental frequency of voiced sound and harmonic wave all are gradual continuously, does not have sudden change between two frames.Like this, even certain segment signal equally distributed energy extreme value occurs because of interference, can not be communicated with to get up continuously mutually yet, thereby can not be judged to voiced sound.In addition, so the normalization variance that the fundamental tone frequency multiplication of some voiced sound is also corresponding very little is the D of each frame AllIn may comprise fundamental frequency and frequency multiplication thereof respectively, therefore,, promptly in two frames of front and back a connecting path is arranged as long as fundamental frequency or frequency multiplication can be communicated with, just can think that the fundamental frequency of this two frame and harmonic wave are continuous, this has just prevented failing to judge of real voiced sound.
Step 314: the start frame numbering t that judges harmonic wave is increased by 1, and continuing to detect down, whether 4 frames are communicated with.
Step 315: judge whether 4 all in buffer memory successive frames all to be done the harmonic wave connection, that is,, then enter step 316, otherwise return step 313 if tested the L-3 frame that is over to continuous 4 frames of L frame.
Step 316: because voiced sound is continuously gradual, so when two sections voiced sounds very near the time, can think that the centre also is a voiced sound for one section.Therefore, information before and after utilizing is done further connection and shaping to declaring the voiced sound that in the buffer memory, and being in the buffer memory between two sections voiced sounds, and time span all is judged to voiced sound less than 20 milliseconds non-voiced sound signal segment.
Step 317: judge currently whether also have new digitized sound signal to need to handle.If also have the signal that needs processing, enter step 318; If do not had the new signal input, then entered step 319.
Step 318: export the voiced sound court verdict of the 1st frame, abandon the content of buffer memory first frame, buffer memory the 2nd frame is moved forward 1 frame respectively to the L frame signal, corresponding every frame result of calculation also moves forward respectively; A new frame of digital voice signal is deposited in buffer memory L frame, and get back to step 302.
Step 319: the voiced sound testing result of every frame signal in the output buffers, the harmonic wave testing process of all signals finishes.

Claims (6)

1, a kind of voiced sound detection method based on harmonic characteristic is characterized in that, comprises the steps:
1) digitized signal of input is carried out pre-service, this pre-service comprises branch frame, windowing, pre-emphasis; Adopt one section buffer memory, preserve length and be 200~400 milliseconds signal and results of intermediate calculations thereof, to every frame signal zero padding and carry out leaf transformation in the N point discrete Fourier; N=2 wherein x, x is an integer, x 〉=8;
2), calculate the energy of this each frequency band of frame respectively, and search for the local extremum of frequency domain energy in view of the above to each frame signal;
3) according to the harmonic wave characteristics of human voiced sound, to step 2) in the frequency band energy extreme value that searches and all fundamental tones of human fundamental tone scope mate, find the fundamental tone of coupling;
4) in all fundamental tones that are complementary with a frame signal, merge wherein at a distance of nearer fundamental tone, and these coupling fundamental tones that remain are formed a set;
5), the frame of some is done the voiced sound judgement of globality in conjunction with front and back information according to phonetic feature;
6) current cache finishes as calculated, if finished the detection of all signals, then exports the result, otherwise the signal in the mobile update buffer memory, and get back to step 1), continue that the signal in the current cache is carried out harmonic wave and detect.
2,, it is characterized in that described step 2 by the described voiced sound detection method of claim 1 based on harmonic characteristic) comprise following substep:
21) choose a frame earlier, calculate the energy of N/2 frequency band of this frame, be expressed as ε from low to high by frequency 1, ε 2... ε N/2, the minimum value of frequency band energy is designated as ε Min
22) frequency band number bin from 2 to Scope in, if frequency band energy ε BinSatisfy ε simultaneously Bin>ε Bin-1, ε Bin>ε Bin+1, and ε Bin>M ε Min, mark ε then BinBe the energy extreme value, wherein M is an empirical constant, 5<M<20;
23) establishing the energy extreme value number that searches is u, by frequency order from low to high, recording step 22) in satisfy the ε of extremum conditions BinSubscript position, be expressed as Ik, k=1 wherein, 2 ... u.
3, by the described voiced sound detection method of claim 1, it is characterized in that described step 3) comprises following substep based on harmonic characteristic:
31) all possible fundamental frequency pitch of traversal α, its scope is [60/R]≤pitch α≤ [450/R], wherein the maximum integer less than this value is got in [] expression, and R is the width of each frequency band, R=sampling rate/N;
32) to the fundamental frequency pitch of current detection αIf in existing extreme point, find 5 with pitch αBe the point of spacing, perhaps find 4 with pitch αBe the point of spacing, and comprise that is positioned at a pitch in these 4 points αPoint, then preliminary determine when last frame signal may be pitch corresponding to fundamental frequency αVoiced sound; If the normalization variance of spacing is less than certain threshold value Var between these points Thd, think that then spacing is even, the average of spacing between the most even extreme point that searches as accurate fundamental frequency;
33) behind all possible fundamental frequency of traversal, record is when pairing each accurate fundamental frequency that may exist of last frame signal and corresponding normalization variance.
4,, it is characterized in that described step 33 by the described voiced sound detection method of claim 3 based on harmonic characteristic) finish after, enter step 4), described step 4) comprises following substep:
41) when step 33) in differ less than D between two accurate fundamental frequencies satisfying condition PitchThe time, only write down less accurate fundamental frequency of wherein normalization variance and corresponding normalization variance thereof; D wherein Pitch=D Min/ R, D MinBe an empirical value, its unit is Hz, and span is 50<D Min<150;
42) at scope [60/R]≤pitch αAll pitch in≤[450/R] αIn, if α=α 1, α 2, α 3.... can both correspond to equally distributed extreme value, and be not less than step 41 apart) in threshold value D Pitch, then each self-corresponding accurate fundamental frequency is designated as a set.
5, by the described voiced sound detection method of claim 4, it is characterized in that enter step 5) after described step 4) is finished, described step 5) comprises following substep based on harmonic characteristic:
51) in 4 frame signals of front and back, if arbitrarily in the set that the accurate fundamental frequency of adjacent two frames is formed, all there is an element at least, very approaching with an element in the other side's set, then can think to have voiced sound in this 4 frame signal; The standard that two elements are close is: both differences and among both the ratio of smaller value be no more than proportionality constant ratio, wherein the scope of ratio is 10%~20%;
52) in step 51) make preliminary ruling after, being in the buffer memory between two sections voiced sounds, and length all is judged to voiced sound less than 20 milliseconds non-voiced sound signal segment.
6, by the described voiced sound detection method of claim 5, it is characterized in that enter step 6) after described step 5) is finished, described step 6) comprises following substep based on harmonic characteristic:
61) if all signals are all finished after testing, do not have new input signal, then export end product;
62) if also have signal not detect, then buffer memory the 2nd frame is moved forward 1 frame respectively to the L frame signal, corresponding every frame result of calculation also moves forward respectively; A new frame of digital voice signal input-buffer L frame, and get back to step 1), continue that the signal in the current cache is carried out harmonic wave and detect.
CN200510089956A 2005-08-08 2005-08-08 Voiced sound detection method based on harmonic characteristic Expired - Fee Related CN100580768C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200510089956A CN100580768C (en) 2005-08-08 2005-08-08 Voiced sound detection method based on harmonic characteristic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200510089956A CN100580768C (en) 2005-08-08 2005-08-08 Voiced sound detection method based on harmonic characteristic

Publications (2)

Publication Number Publication Date
CN1912992A true CN1912992A (en) 2007-02-14
CN100580768C CN100580768C (en) 2010-01-13

Family

ID=37721903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200510089956A Expired - Fee Related CN100580768C (en) 2005-08-08 2005-08-08 Voiced sound detection method based on harmonic characteristic

Country Status (1)

Country Link
CN (1) CN100580768C (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102655000A (en) * 2011-03-04 2012-09-05 华为技术有限公司 Method and device for classifying unvoiced sound and voiced sound
CN103310800A (en) * 2012-03-06 2013-09-18 中国科学院声学研究所 Voiced speech detection method and voiced speech detection system for preventing noise interference
CN103886871A (en) * 2014-01-28 2014-06-25 华为技术有限公司 Detection method of speech endpoint and device thereof
CN104700843A (en) * 2015-02-05 2015-06-10 海信集团有限公司 Method and device for identifying ages
CN104934032A (en) * 2014-03-17 2015-09-23 华为技术有限公司 Method and device for voice signal processing according to frequency domain energy
CN105067101A (en) * 2015-08-05 2015-11-18 北方工业大学 Fundamental tone frequency characteristic extraction method based on vibration signal for vibration source identification
CN106356076A (en) * 2016-09-09 2017-01-25 北京百度网讯科技有限公司 Method and device for detecting voice activity on basis of artificial intelligence
CN106680585A (en) * 2017-01-05 2017-05-17 攀枝花学院 Detection method of harmonics/inter-harmonics
CN107833581A (en) * 2017-10-20 2018-03-23 广州酷狗计算机科技有限公司 A kind of method, apparatus and readable storage medium storing program for executing of the fundamental frequency for extracting sound
CN110111811A (en) * 2019-04-18 2019-08-09 腾讯音乐娱乐科技(深圳)有限公司 Audio signal detection method, device and storage medium
CN110739006A (en) * 2019-10-16 2020-01-31 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method and device, storage medium and electronic equipment
CN111061909A (en) * 2019-11-22 2020-04-24 腾讯音乐娱乐科技(深圳)有限公司 Method and device for classifying accompaniment
CN112885380A (en) * 2021-01-26 2021-06-01 腾讯音乐娱乐科技(深圳)有限公司 Method, device, equipment and medium for detecting unvoiced and voiced sounds

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1151490C (en) * 2000-09-13 2004-05-26 中国科学院自动化研究所 High-accuracy high-resolution base frequency extracting method for speech recognization

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102655000B (en) * 2011-03-04 2014-02-19 华为技术有限公司 Method and device for classifying unvoiced sound and voiced sound
CN102655000A (en) * 2011-03-04 2012-09-05 华为技术有限公司 Method and device for classifying unvoiced sound and voiced sound
CN103310800A (en) * 2012-03-06 2013-09-18 中国科学院声学研究所 Voiced speech detection method and voiced speech detection system for preventing noise interference
CN103310800B (en) * 2012-03-06 2015-10-07 中国科学院声学研究所 A kind of turbid speech detection method of anti-noise jamming and system
CN103886871A (en) * 2014-01-28 2014-06-25 华为技术有限公司 Detection method of speech endpoint and device thereof
CN103886871B (en) * 2014-01-28 2017-01-25 华为技术有限公司 Detection method of speech endpoint and device thereof
CN104934032B (en) * 2014-03-17 2019-04-05 华为技术有限公司 The method and apparatus that voice signal is handled according to frequency domain energy
CN104934032A (en) * 2014-03-17 2015-09-23 华为技术有限公司 Method and device for voice signal processing according to frequency domain energy
CN104700843A (en) * 2015-02-05 2015-06-10 海信集团有限公司 Method and device for identifying ages
CN105067101A (en) * 2015-08-05 2015-11-18 北方工业大学 Fundamental tone frequency characteristic extraction method based on vibration signal for vibration source identification
CN106356076B (en) * 2016-09-09 2019-11-05 北京百度网讯科技有限公司 Voice activity detector method and apparatus based on artificial intelligence
CN106356076A (en) * 2016-09-09 2017-01-25 北京百度网讯科技有限公司 Method and device for detecting voice activity on basis of artificial intelligence
CN106680585B (en) * 2017-01-05 2019-01-29 攀枝花学院 Harmonic wave/m-Acetyl chlorophosphonazo detection method
CN106680585A (en) * 2017-01-05 2017-05-17 攀枝花学院 Detection method of harmonics/inter-harmonics
CN107833581A (en) * 2017-10-20 2018-03-23 广州酷狗计算机科技有限公司 A kind of method, apparatus and readable storage medium storing program for executing of the fundamental frequency for extracting sound
CN107833581B (en) * 2017-10-20 2021-04-13 广州酷狗计算机科技有限公司 Method, device and readable storage medium for extracting fundamental tone frequency of sound
CN110111811A (en) * 2019-04-18 2019-08-09 腾讯音乐娱乐科技(深圳)有限公司 Audio signal detection method, device and storage medium
CN110111811B (en) * 2019-04-18 2021-06-01 腾讯音乐娱乐科技(深圳)有限公司 Audio signal detection method, device and storage medium
CN110739006A (en) * 2019-10-16 2020-01-31 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method and device, storage medium and electronic equipment
CN111061909A (en) * 2019-11-22 2020-04-24 腾讯音乐娱乐科技(深圳)有限公司 Method and device for classifying accompaniment
CN111061909B (en) * 2019-11-22 2023-11-28 腾讯音乐娱乐科技(深圳)有限公司 Accompaniment classification method and accompaniment classification device
CN112885380A (en) * 2021-01-26 2021-06-01 腾讯音乐娱乐科技(深圳)有限公司 Method, device, equipment and medium for detecting unvoiced and voiced sounds

Also Published As

Publication number Publication date
CN100580768C (en) 2010-01-13

Similar Documents

Publication Publication Date Title
CN1912992A (en) Voiced sound detection method based on harmonic characteristic
CN1912993A (en) Voice end detection method based on energy and harmonic
CN106409310B (en) A kind of audio signal classification method and apparatus
CN101872616B (en) Endpoint detection method and system using same
CN102819009B (en) Driver sound localization system and method for automobile
CN102356427B (en) Noise suppression device
CN102044246B (en) Method and device for detecting audio signal
CN108899052B (en) Parkinson speech enhancement method based on multi-band spectral subtraction
CN104599677B (en) Transient noise suppressing method based on speech reconstructing
CN103646649A (en) High-efficiency voice detecting method
CN1335980A (en) Wide band speech synthesis by means of a mapping matrix
CN101145345B (en) Audio frequency classification method
CN101996640B (en) Frequency band expansion method and device
CN106650576A (en) Mining equipment health state judgment method based on noise characteristic statistic
CN102097095A (en) Speech endpoint detecting method and device
CN102592589B (en) Speech scoring method and device implemented through dynamically normalizing digital characteristics
CN103117067A (en) Voice endpoint detection method under low signal-to-noise ratio
CN113724712B (en) Bird sound identification method based on multi-feature fusion and combination model
Ealey et al. Harmonic tunnelling: tracking non-stationary noises during speech.
CN100541609C (en) A kind of method and apparatus of realizing open-loop pitch search
Hansen et al. Iterative speech enhancement with spectral constraints
Wu et al. A pitch-based method for the estimation of short reverberation time
Ouzounov A robust feature for speech detection
Sorin et al. The ETSI extended distributed speech recognition (DSR) standards: client side processing and tonal language recognition evaluation
Vahatalo et al. Voice activity detection for GSM adaptive multi-rate codec

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100113

CF01 Termination of patent right due to non-payment of annual fee