CN1151490C

CN1151490C - High-accuracy high-resolution base frequency extracting method for speech recognization

Info

Publication number: CN1151490C
Application number: CNB001247115A
Authority: CN
Inventors: 波徐; 徐波; 张健
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2000-09-13
Filing date: 2000-09-13
Publication date: 2004-05-26
Anticipated expiration: 2020-09-13
Also published as: CN1342968A

Abstract

The present invention relates to a high-accuracy high-resolution fundamental frequency extracting method for speech recognition, which makes a frequency domain, a time domain analysis and a dynamic programming (DP) combined together. The present invention is characterized in that speech signals are processed by FFT transformation at the frequency domain for harmonic analysis; a plurality of candidates of fundamental frequencies are selected by peak detection; the candidate fundamental frequencies are evaluated by autocorrelation coefficients at the time domain; the analytic results of the frequency domain and the time domain, and the fundamental frequency variation are backdated and synthesized in an indefinite length mode by a dynamic programming algorithm to determine an optimal fundamental frequency contour line. In order to guarantee the resolution extracted by the fundamental frequencies, the present invention adopts the methods of reducing sampling rate, interpolation, etc.

Description

The high-accuracy high-resolution fundamental frequency extracting method that is used for speech recognition

Technical field

The present invention has extracted a kind of fundamental frequency extracting method that frequency domain, time-domain analysis and dynamic programming (DP) are combined, and belongs to signal Processing and field of speech recognition.

Background technology

At present, under the continuous speech condition, extracting reliable fundamental frequency change curve is the basis of Chinese language tone identification and modeling.But the difficulty that fundamental frequency extracts mainly is, is subjected to the influence of factors such as sound channel resonance peak and noise, and frequency multiplication and the half frequency value at correct fundamental frequency has than the more obvious feature of fundamental frequency (bigger as corresponding peaks) sometimes.General independent time domain or the frequency domain of utilizing of traditional method carries out the fundamental frequency feature extraction, and only extracts the basis of a single fundamental frequency value as subsequent treatment such as median smoothing at all time points.So only with a kind of method of frequency domain or time domain, only extract fundamental frequency and some mistakes usually occur from an extreme standard (as only getting a certain largest peaks), cause the fundamental frequency track that obtains that obvious discontinuous variation is arranged, this does not obviously square with the fact.As: the test findings that J.Sundberg1979 is published on the J.of Phonetics shows that the rate of change of fundamental frequency is in the scope of about 1%/ms.Obvious discontinuous variation for fundamental frequency, can be with certain the remedying in addition of traditional DP (dynamic programming) method, but must handling one whole section voice, traditional DP algorithm could export the fundamental frequency track, this delay is very difficult received in real-time speech recognition system, all are in the application of some other DP algorithm, and the method that all adopts fixed length to recall addresses this problem.

The subject matter that fundamental frequency extracts is:

1. on frequency domain, the frequency spectrum that voice signal obtains through FFT (fast fourier transform) is at F ₀/ 2, F ₀, 2F ₀, 3F ₀... locate often to have bigger peak (F to occur ₀Represent correct fundamental frequency), as shown in Figure 2, the spectrum that a unvoiced frame obtains after the FFT computing, it is at F ₀, 2F ₀, 3F ₀There is tangible peak at the place, but fundamental frequency F ₀The peak value at place does not but obviously have 2F ₀The place is high.If any algorithm only the frequency of maximum peak correspondence as fundamental frequency; Perhaps someways extract a plurality of peaks, and calculate between these peak values range averaging with the estimation fundamental frequency.But some peak value is not fairly obvious sometimes, even the phenomenon of " fundamental tone disappearance " (missing fundamental) can occur, and this has caused very big difficulty for traditional method of doing fundamental detection with spectrum analysis.

2. on time domain, a kind of representational fundamental frequency extracting method is the short-time autocorrelation function method.The computing formula of short-time autocorrelation function is:

R [m] = \frac{1}{N - m} Σ_{i = 0}^{N - m - 1} x_{i} x_{i} + m - - - (1)

X wherein _iRepresent i sampling point value in the frame, N represents the total sampled point number of every frame, R[m] autocorrelation value when indication cycle is m sampled point.According to the character of autocorrelation function, the autocorrelation function of cyclical signal also is periodic, and identical with the cycle of voice signal, in general, if T ₀Represent pitch period (T ₀=1/F ₀), except R[0], T ₀The peak value maximum at place.Fig. 4 has provided the autocorrelator trace of a unvoiced frame, can see, at T ₀, 2T ₀, 3T ₀There is bigger peak value at the place, but T ₀The peak height and the 2T at place ₀, 3T ₀Very approaching.So the maximum peak that detects autocorrelation function can be used as a kind of method of estimating fundamental frequency, but similar with top situation, at T ₀/ 2,2T ₀, 3T ₀... neighbouring usually have bigger peak value, and maximum peak occurs sometimes in these peaks.This brings certain limitation for this algorithm.

In addition, resolution and calculated amount have caused certain difficulty also for the extract real-time of fundamental frequency.

For example in the method that adopts the FFT spectrum analysis, the resolution that fundamental frequency extracts is directly proportional with counting of FFT calculating, and its calculated amount is to be exponential increase thereupon, and this just causes and can not calculate with the FFT that counts too much; (the fundamental frequency scope that refers to human speech) calculates auto-correlation with a large amount of multiplying of search higher value needs in all effective frequency domain scope, is very consuming time, needing not too to be suitable in the speech recognition system of processing in real time; The candidate value number of DP algorithm also influences the speed of decision of DP, if with all fundamental frequencies in effective frequency domain scope all as the candidate, also certainly will increase the weight of the burden of making a strategic decision greatly.

Summary of the invention

The objective of the invention is to: propose a kind of new fundamental frequency extracting method.The basic thought of this method should be: earlier voice signal is carried out the FFT conversion on frequency domain, again it is carried out frequency analysis, detect by peak value then, select the candidate of several fundamental frequencies, on time domain, these candidate's fundamental frequencies are evaluated and tested with coefficient of autocorrelation, determined the fundamental frequency track of an optimum again with dynamic programming algorithm.

Technical essential of the present invention is: a kind of speech recognition fundamental frequency analysis of high-accuracy high-resolution fundamental frequency extracting method in conjunction with time-domain and frequency-domain, extract possible several fundamental frequency candidates, and adopt a kind of coefficient of autocorrelation, the objective function that spectrum analysis result and fundamental frequency variable quantity combine is searched for dynamic programming algorithm, before carrying out auto-correlation calculating, the spectrum analysis result who uses according to speech recognition, utilize the FFT harmonic analysis method, fundamental frequency scope to voice is carried out preliminary estimation, reduce the scope that auto-correlation is calculated, and fundamental frequency extraction precision is not exerted an influence; The feature of harmonic analysis method comprises down-sampled rate, pre-emphasis and windowing, FFT calculating and carries out steps such as secondary Spline interpolation in order to improve resolution.

1.FFT the frequency spectrum of trying to achieve adopts harmonic analytic method, at first for each fundamental frequency of effective fundamental frequency scope calculate adding up of several harmonic waves and.If use x _(i)The voice signal of representing a frame is represented its FFT conversion with X (f), the power spectrum that S (f) expression obtains, that is:

S(f)＝|X(f)| ² (2)

The result that adds up of harmonic wave is so:

H (f) = Σ_{n = 1}^{HN} h_{n} S (nf) - - - (3)

HN represents the number of harmonic wave, h _nBe the power of n harmonic wave.Power spectrum is not carried out logarithm Log computing in this method, linear frequency domain is not transformed into the logarithm frequency domain yet, promptly ask Log (F ₀), this has reduced calculated amount to a certain extent.And then from the H as a result (f) of frequency analysis, select the peak of top's maximum by linear search, (establish P ₁, P ₂... P _PNRepresent this several peak values), with the candidate value of correspondent frequency as fundamental frequency.Calculate the relative maximum peak P of each peak value simultaneously _MaxSize as the grading parameters H of back DP algorithm _Per, that is:

H_{per} (i) = P_{i} / P_{\max}, i = 1 . . PN - - - (4)

The harmonic wave that example such as Fig. 3 that frequency analysis fundamental frequency peak value detects, Fig. 3 have provided the unvoiced frame figure that adds up is through Harmonics Calculation, at F ₀There is a tangible maximum peak at the place, together with the several peaks (representing with circle) that also have other in addition, all as the candidate of fundamental frequency.

2. on time domain, the fundamental frequency candidate that frequency domain is obtained calculates autocorrelation value { R[i] }, its result R[0] carried out normalization, i.e. coefficient of autocorrelation: R _Per(i)=R[i]/R[0].As mentioned above, the peakedness ratio of the coefficient of autocorrelation of voiced sound (voiced) frame is bigger, with 1 more approaching, as shown in Figure 4, and do not have the autocorrelative peak value of non-voiced sound (unvoiced) frame of fundamental frequency all very as shown in Figure 5 low, so coefficient of autocorrelation can be used as the foundation that pure and impure sound is differentiated, and can filter out some tangible non-unvoiced frames.Utilize coefficient of autocorrelation, the limited fundamental frequency candidate that frequency domain is obtained gives a mark and screens, both reduced the calculated amount of autocorrelative calculated amount and follow-up DP algorithm, also made simultaneously when selecting fundamental frequency the comprehensively analysis result of frequency domain and time domain with DP algorithm, more reasonable.

3. dynamic programming algorithm is a kind of widely used optimization algorithm.It turns to the subprocess of a plurality of single phases decision-making to multistage decision process, makes computational short cut.Each decision-making is given a mark respectively according to scoring function to all possible paths, according to the size of score the path is reduced, and exports the path of score optimum (maximum or minimum) at last.The each fundamental frequency candidate who handles a frame of this DP algorithm carries out continuation to original fundamental frequency path, and to each paths marking, it is the highest to keep some scores then.The score formula is:

Score(i)＝max{Score(i-1)-D(i，j)}+aR _per(i)+bH _per(i) (5)

Wherein Score (i) represents the score of this fundamental frequency path at the i frame, a, and b is expression R _Per(i) and H _Per(i) coefficient of weight, D (i, j) the fundamental frequency p of expression i frame _iJ fundamental frequency candidate p with the i-1 frame _jDistance

D(i，j)＝2*|p _i-p _j|/(p _i+p _j)

The scoring function that this method adopts has been considered the factor of three aspects: continuity, coefficient of autocorrelation accumulated value ∑ R that fundamental frequency changes _Per(i) maximum, the harmonic wave accumulated value ∑ H of relative peak that adds up _Per(i) maximum.

These three aspects all are the important evidence that fundamental frequency is differentiated.Latter two factors was mentioned in front.

Wherein the continuity of fundamental frequency variation is that the fundamental frequency track of considering human pronunciation can not produce the principle of suddenling change, to possible fundamental frequency path marking.

This algorithm does not have the characteristics of obvious fundamental frequency according to non-voiced sound, has adopted the random length retrogressive method.Concrete grammar is as follows: first frame with continuous unvoiced frame of this DP algorithm begins, and does end with the last frame of continuous unvoiced frame, carries out the part of dynamic random length and recalls.Because the fundamental frequency that proposes at non-unvoiced frame is nonsensical, the continuity of fundamental frequency also can't embody after entering unvoiced frame.If in the overall situation, carry out dynamic programming, then certainly will add too much interference mark, make fundamental frequency track generation random offset, have a strong impact on the effect of DP algorithm sometimes.The continuous basis for estimation zero-crossing rate detection of unvoiced frame and coefficient of autocorrelation and the harmonic wave relative peak that adds up in this method, chosen the threshold value that better discrimination is arranged according to statistics, do not exist fundamental frequency candidate's frame to be judged as non-voiced sound through after the threshold filtering, do not participate in DP algorithm and ask the fundamental frequency track.Reducing major part does not like this have the non-unvoiced frame of fundamental frequency, greatly reduces their influences to extracting fundamental frequency with DP algorithm, has reduced mistake; Both guarantee simultaneously the continuity of fundamental frequency track, also significantly reduced the delay of DP algorithm.The non-unvoiced frame of minority of being included can not have big influence to DP algorithm; The unvoiced frame that minority is not included can well be compensated in line.

4. other measure.

(a) zero-crossing rate detects.As mentioned above, for differentiation rate that improves voiced sound and non-voiced sound and the speed that improves the fundamental frequency extraction, voice signal has at first been carried out the zero-crossing rate detection.Preestablish a bigger threshold value, zero-crossing rate surpasses the non-voiced sound that is judged as of this value, does not carry out fundamental frequency and extracts.Otherwise extract fundamental frequency with FFT/DP algorithm cited below.The bigger reason that threshold value is established is to guarantee that voiced sound is not filtered.In order to reduce the influence of waveform skew, the reference value of zero-crossing rate is not zero, but the average of frame data.

(b) down-sampled rate and interpolation.Reduce fundamental frequency extraction precision for improving, carried out down-sampled rate processing, the frequency spectrum that is obtained by FFT has been carried out interpolation carrying out FFT conversion voice signal before.If the crude sampling rate is SR, after sampling rate is reduced to original 1/RDC, carry out the FFT conversion that FFTLen is ordered, the power spectrum that obtains is used Inpl_N point interpolation (that is: every adjacent amount FFT point interleaves Inpl_N-1 point) again, and the fundamental frequency that obtains so extracts resolution and is exactly:

\frac{SR}{RDC * FFTLen * Inpl_N} - - - (7)

In present embodiment, original signal sampling rate SR=16000Hz is reduced to 1/4, that is: RDC=4, and the FFT conversion is counted and is FFTLen=512, interpolation Inpl_N=20 point.Extracting resolution like this is exactly 0.39Hz.And for the autocorrelation function extraction method of 384 sampled points of every frame, extraction resolution is 16000/N ²(N is the signal period corresponding sampling points), promptly approximately from 15Hz to 0.4Hz, obviously it has sufficiently high resolution in the fundamental frequency scope.

Adopt the advantage of foregoing invention to be: to extract this two kinds of methods in conjunction with time domain and frequency domain fundamental frequency, can carry out effective complementation: by frequency domain handle according to harmonic wave and size determine the fundamental frequency candidate, the centre does not have artificial settings such as thresholding, can reduce the uncertainty amount of calculating; On the dynamic programming function, considered time domain auto-correlation and frequency domain harmonic wave and these two and the closely-related important indicator of fundamental frequency simultaneously, make it maximum, guaranteed the continuity and the reliability of fundamental frequency; Random length is recalled the theory time-delay that then can reduce speech recognition system, and this requires more intense extremely important as call voice identification etc. to real-time.

For validity more of the present invention, disturb strong, fundamental frequency to extract two difficult voice and extracted 725 frame voiced sounds from sound channel, through artificial judgment, with tangible discontinuous fundamental frequency as identification error.Do not adopting before fundamental frequency candidate among the present invention adds DP algorithm, the identification error rate is 11.7%, is 0.4% after adopting.The accuracy rate of identification improves greatly.

In conjunction with foregoing description, the flow process that complete fundamental frequency extracts as shown in Figure 6:

(1) signal segmentation, the voice signal of input at first is split into some frames, and adjacent two frames have certain overlapping; Respectively every frame is carried out following processing then;

(2) zero-crossing rate detects, and calculates average zero passage number of times, carries out the guestimate of pure and impure sound; The zero passage number of times is higher than given threshold value and is judged as non-unvoiced frame, does not carry out fundamental frequency and extracts;

(3) down-sampled rate had both improved fundamental frequency and had extracted resolution, can guarantee again not lose that 1250Hz is above that fundamental frequency is extracted significant frequency content;

(4) pre-emphasis and windowing, the purpose that this one handles is to have reduced frequency alias.What adopt is the hamming window.Formula is:

Wherein h (n) represents window function, and N represents the length of window;

(5) FFT calculates and asks power spectrum, employing be 512 FFT (fast fourier transform).Compose with formula (2) rated output;

(6) interpolation in order to improve the precision that fundamental frequency extracts, is inserted Inpl_N=20 value between per two FFT point values of frequency spectrum, use be secondary Spline interpolation method;

(7) calculate that harmonic wave adds up and, the formula that harmonic wave adds up is formula (3):

(8) determine the fundamental frequency candidate, from harmonic wave adds up spectrum, select several peaks, the fundamental frequency that peaking point is corresponding,

And the relative height of this peak value and maximum peak, that is: H _Per

(9) candidate to a plurality of fundamental frequencies asks corresponding coefficient of autocorrelation, that is: R on time domain _PerFor having low H _PerValue or R _PerThe fundamental frequency candidate of value is filtered, to reduce the operand of DP;

(10) ask the fundamental frequency path with DP algorithm, calculate the score of every paths with DP algorithm, the score formula is formula (5). According to score, get several paths of top and note;

(11), then export as the result who extracts with the fundamental frequency path of optimum according to the score of DP algorithm if handle one section continuous unvoiced frame; Otherwise returned for the 2nd step next frame is done the zero-crossing rate detection; Handle whole input signals, enter normalization and be connected,, carry out normalization, can eliminate speaker's difference like this with average fundamental frequency to one section voice of handling; Connection is that the consonant part that does not have fundamental frequency is partly carried out level and smooth being connected with the vowel with fundamental frequency.

Description of drawings

Fig. 1. fundamental frequency extracts example as a result, and the left side is the fundamental frequency track synoptic diagram of tone correspondence in 4 in the isolated voice, and the right is the waveform of continuous speech, the fundamental frequency track and the voice content (from top to bottom) of extraction;

Fig. 2. the spectrogram that a unvoiced frame obtains through FFT result.At F ₀, 2F ₀, 3F ₀There is tangible peak at the place, but F ₀The peak at place does not have 2F ₀Obviously;

Fig. 3. the harmonic wave of the unvoiced frame figure that adds up, at F ₀Tangible maximum peak is arranged, together with also having other several peaks (shown in the circle) in addition, as the candidate of fundamental frequency;

Fig. 4. the autocorrelator trace figure of a unvoiced frame, at T ₀, 2T ₀3T ₀There is bigger peak value at the place, but T ₀The height and the 2T at peak, place ₀, 3T ₀Very approaching;

Fig. 5. the autocorrelator trace figure of a unvoiced frames, except R[0] peak value thought is all lower, also do not have obvious periodic;

Fig. 6. fundamental frequency extracts process flow diagram.

Embodiment

A tone that distinguishing feature is exactly a Chinese of Chinese speech.The standard Chinese tone generally is divided into, two, three, the four tones of standard Chinese pronunciation, that is: high and level tone, rising tone, last sound, falling tone.Also have softly (zero tone) in addition.The female syllable of forming of identical sound has diverse meaning, corresponding different word or speech probably.(for example: rise, week, prison term, the departure date, stink smell.) therefore, tone has very important significance in the identification of Chinese language.

Tone is embodied in fundamental frequency F ₀The variation of track.What Fig. 1 represented on the left side is the tone and the fundamental frequency contrast of isolated voice, and the right is the contrast figure of continuous speech tone and fundamental frequency.And the fundamental frequency extracting method that the present invention proposes can provide fundamental frequency track comparatively accurately for speech recognition, is used for Tone recognition; Particularly can be with the fundamental frequency that extracts parameter as speech recognition system, or set up the tone model, to reflect phonetic feature more comprehensively, instruct the processing of language, improve the accuracy of identification.

The accurate extraction of fundamental frequency track also can be used as the important evidence of information Recognition such as syntactic structure in the continuous speech, word are read again, speaker's purpose; The tone of Chinese is the difficult point in the Chinese teaching, the pronunciation that the demonstration by fundamental curve can the interactively instruction of papil, thereby can be used in the teaching of standard Chinese.Though tone can be subjected to contextual influence, tone information more is not subjected to the influence of acoustic enviroment and passage, thereby from some angles, tone is significant for the reliability that improves the voice messaging processing.

Claims

1. a high-accuracy high-resolution fundamental frequency extracting method that is used for speech recognition is characterized in that, the step that fundamental frequency extracts is:

(1) signal segmentation, the voice signal of input at first is split into some frames, and adjacent two frames have certain overlapping; Each frame is handled successively according to the following steps:

(3) down-sampled rate is guaranteeing not lose the suitable sampling rate that reduces under the prerequisite of below the 1250Hz fundamental frequency being extracted significant frequency content;

(4) pre-emphasis and windowing, employing be the hamming window, formula is:

(5) FFT calculates and asks power spectrum, employing be the FFT (fast fourier transform) of multiple spot, with formula rated output spectrum:

S(f)＝|X(f)| ²

Wherein use the FFT conversion of X (f) expression signal, S (f) represents power spectrum;

(6) interpolation is inserted Inpl_N value between per two FFT point values of frequency spectrum, use be secondary Spline (batten) interpolation method;

(7) calculate that harmonic wave adds up and, obtain the harmonic wave spectrum that adds up, its computing formula is:

H (f) = Σ_{n = 1}^{HN} h_{n} S (nf)

Wherein S (f) expression is through the power spectrum after the interpolation, and HN represents the maximum number of harmonic wave, h _nBe the weights of n harmonic wave, the harmonic wave of H (f) the expression frequency f correspondence spectrum that adds up;

(8) peak value detects, and determines the fundamental frequency candidate, selects several peaks from harmonic wave adds up spectrum, and the fundamental frequency of peaking point correspondence is as the fundamental frequency candidate, and the relative height of this peak value and maximum peak, i.e. H _Per

(9) candidate to a plurality of fundamental frequencies asks corresponding coefficient of autocorrelation, i.e. R on time domain _PerFor having low H _PerValue or R _PerThe fundamental frequency candidate of value is filtered, to reduce the operand of next step dynamic programming (DP);

(10) ask the fundamental frequency track with dynamic programming (DP) algorithm, calculate the score of every track with DP algorithm; The score formula be Score (i)=max{Score (i-1)-D (i, j) }+aR _Per(i)+bH _Per(i).According to score, several track records obtaining the branch top get off; Score in the score formula (i) represents the score of this fundamental frequency path at the i frame, and a, b are respectively expression R _Per(i) and H _Per(i) coefficient of weight, D (i, j) the fundamental frequency p of expression i frame _iJ fundamental frequency candidate p with the i-1 frame _iDistance, its computing formula is: D (i, j)=2*|p _i-p _j|/(p _i+ p _j)

(11) handle one section continuous unvoiced frame, then export as the result who extracts with the fundamental frequency track of optimum according to the score of DP algorithm; Otherwise returned for (2) step next frame is done the zero-crossing rate detection; Handle whole input signals, enter normalization and line; To one section voice of handling, carry out normalization with average fundamental frequency; Connection is that the non-voiced sound part that does not have fundamental frequency is partly carried out level and smooth being connected with the voiced sound with fundamental frequency.