AU598933B2

AU598933B2 - An adaptive threshold voiced detector

Info

Publication number: AU598933B2
Application number: AU17007/88A
Authority: AU
Inventors: David Lynn Thomson
Original assignee: American Telephone and Telegraph Co Inc
Current assignee: AT&T Corp
Priority date: 1987-04-03
Filing date: 1988-01-12
Publication date: 1990-07-05
Anticipated expiration: 2008-01-12
Also published as: SG60993G; DE3876569T2; CA1336208C; DE3876569D1; JPH01502858A; AU1700788A; JPH0795239B2; EP0309561B1; EP0309561A1; WO1988007739A1; HK21794A; ATE83329T1

Abstract

Apparatus for detecting a fundamental frequency in speech by statistically analyzing a discriminant variable generated by a discriminant voiced detector (102) so as to determine the presence of the fundamental frequency in a changing speech environment. A statistical calculator (103) is responsive to the discriminant variable to first calculate the average of all of the values of the discriminant variable over the present and past speech frames and then to determine the overall probability that any frame will be unvoiced. In addition, the calculator informs two values, one value represents the statistical average of discriminant values that an unvoiced frame's discriminant variable would have and the other value represents the statistical average of the discriminant values for voice frames. These latter calculations are performed utilizing not only the average discriminant value but also a weight value and a threshold value which are adaptively determined by a threshold calculator (104) from frame to frame. An unvoiced/voiced determinator (105) makes the unvoiced/voiced decision by utilizing the weight and threshold values.

Description

i: I i i

PCT

WORLD

INTERNATIONAL APPLICATION PUBLISHED L NER W'E PX~ENikbOWAATION TREATY (PCT) (51) International Patent Classification 4 3/00 (11) Interiational Publication Number: WO 88/ 07739 Al (43) International Publicatio. Date: 6 October 1988 (06.10.88) -r (21) International Application Number: PCT/US88/00031 (22) International Filing Date: 12 January 1988 (12.01.88) (31) Priority Application Number: (32) Priority Date: 034,298 3 April 1987 (03.04.87) (33) Priority Country: US (71) Applicant: AMERICAN TELEPHONE TELE- GRAPH COMPANY [US/US]; 550 Madison Avenue, New York, NY 10022 (US).

(72) Inventor: THOMSON, David, Lynn 29W543 Country Ridge Drive, Apartment D, Warrenville, IL 60555

(US),

(74) Agents: HIRSCH, Jr, et al,; Post Office Box 679, Holmdel, NJ 07733 (US).

t (81) Designated States: AT (European patent), AU, BE (European patent), CH (European patent), DE (European patent), FR (European patent), GB (European patent), IT (European patent), JP, LU (European patent), NL (European patent), SE (European patent).

Published With international search report.

Before the expiration of the time limit for amending thr claims and to be republished in the event of the receipt of amendments.

A.O.J. DEC 1988

AUSTRALIAN

2 NOV 1988 S PATENT OFFICE (54) Title: AN ADAPTIVE THRESHOLD VOICED DETECTOR (57) Abstract Apparatus for detecting a fundamental frequency in speech by statistically analyzing a discriminant variable generated by a discriminant voiced detector (102) so as to determine the presence of the fundamental frequency in a changing speech environment. A statistical calculator (103) is responsive to the discriminant variable to first calculate the average of all of the values of the discriminant variable over the present and past speech frames and then to determine the overall probability that any frame will be unvoiced, In addition, the calculator informs two values, one value represents the statistical average of discriminant values that an unvoiced frame's discriminant variable would have and the other value represents the statistical average of the discriminant values for voice frames. These latter calculatiops are performed utiliz:g not only the average discriminant value but also a weight value and a threshold value which are adaptively determined by a threshold calculator (104) from frame to frame. A, unvoiced/voiced determinator (105) makes the unvoiced/voiced deci.

sion by utilizing the weight and threshold values.

WQ 88/07739 PCT/US88/00031 -1- AN ADAPTIVE THRESHOLD VOICED DETECTOR Technical Field This invention relates to determining whether or not speech contains a fundamental frequency which is commonly referred to as the unvoiced/voiced decision. More particularly, the unvoiced/voiced decision is made by a two stage voiced detector with the final threshold values being adaptively calculated for the speech environment utilizing statistical techniques.

Background and Problem In low bit rate voice coders, degradation of voice quality is often due to inaccurate voicing decisions, The difficulty in correctly making these voicing decisions lies in the fact that no single speech parameter or classifier can reliably distinguish voiced speech from unvoiced speech. In order to make the voice decision, it is known in the art to combine multiple speech classifiers in the form of a weighted sum. This method is commonly called discriminant analysis. Such a method is illustrated in D. P. Prezas, et al., "Fast and Accurate Pitch Detection Using Pattern Recognition and Adaptive Time-Domain Analysis," Proc. IEE Int.

Conf. Acoust., Speech and Signal Proc., Vol. 1, pp. 109-112, April 1986. As described in that article, a frame of speech is declared vice if a weighted sum of classifiers is greater than a specified threshold, and unvoiced otherwise. The we-ights and threshold are chosen to maximize performance on a training set of speech where the voicing of each frame is known.

A problem assoiated with the fixed weighted sum method is that it does not perform well when the speech environment changes. The reason is that the threshold is determined from the training set which is different from speech sodbject to background noise, non-linear distortion, and filtering.

One method for adapting the threshold value to changing speech environment is disclosed in the paper of H. Hassanein, et al., "Implementation of the Gold-Rabiner Pitch Detector in a Real Time Environment Using an Improved Voicing Detector," IEEE Transactionl on Acoustic, Speech and Signal Processing, 1986, Tokyo, Vol. ASSP-33, No. 1, pp. 319-320. This paper discloses an empirical method which compares three different parameters against independent thresholds associated with these parameters and on the basis of each comparison either iincrements or decrements by one an adaptive threshold value.

The three parameters utilized are energy of the signal, first reflection coefficient, WO 8/07739 PCT/US88/00031 -2and zero-crossing count. For example, if the energy of the speech signal is less than a predefined energy level, the adaptive threshold is incremented. On the other hand, if the energy of the speech signal is greater than another predefined energy level, the adaptive threshold is decremented by one. After the adaptiv/ threshold has been calculated, it is subtracted from an output of a elemen pitch detector. If the results of the subtraction yield a positive number, the sp ch frame is declared voice; otherwise, the speech frame is declated on u oice. The problem with the disclosed method is that the parameters themselv are not used in the elementary pitch detector. Hence, the adjustment of aptive threshold is ad hoc and is not directly linked to the physical phenomeni from which it is calculated. In addition, the theshald cannot adapt to rapid changing speech environments.

Solution The above described problem is solveand a technical advance is achieved by a voicing decision apparatus that a pts to a changing environment by utilizing adaptive statistical values to make voicing decision. The statistical values are adapted to the changing enviro ent by utilizing statistics based on an output of a voiced detector. The statist al parameters are calculated by the voiced detector generating a general alue indicating the presence of a fundamental frequency in a speec frame in response to speech attributes of the frame. Second, the mean for u voiced ones and voiced ones of speech frames is calculated in response to the enerated value. The two means are then used to determine decision region and the determination of the presence of the fundamental frequency 's done in response to the decision regions and the present speech frame.

Adv tageously, in response to speech attributes of the present and past speech fres, the mean for (invoiced frai.es is calculated by calculating the probability tat the present speech fame is unvoiced, calculating the overall probabili that any frame will be unvoiced, and calculating the probability that the p eent speech frame is voiced, The mean of the unvoiced speech frames is the calculated in response to the probability that the present speech frame is voiced and the overall probability. In addition, the mean of thb voiced speech frame is calculated in response to the probability that the present speech frame is voiced and the overall probability. Advantageously, the calculations of probabilities are performed utilizing a maximum likelihood statistical operation.

-2and zero-crossing count. For example, if the energy of the speech signal is less than a predefined energy level, the adaptive threshold is incremented. On the other hand, if the energy of the speech signal is greater than another predefined energy level, the adaptive threshold is decremented by one. After the adaptive threshold has been calculated, it is subtracted from an output of a elementary pitch detector. If the results of the subtraction yield a po!;itive number, the speech frame is declared voice; otherwise, the speech framr is declared on unvoice. The problem with the disclosed methou is that the parameters themselves are not used in the elementary pitch detector. Hence, the adjustment of the adaptive threshold is ad hoc and is not directly linked to the physical phenomena from which it is calculated. In addition, the threshold cannot adapt to rapidly changing speech environments.

15 Solution ~According to a first aspect of the invention there is provided a method for detecting the presence of a fundamental frequency in frames of speech comprising: generating a set of classifiers for each frame in turn defininq speech attributes of the corresponding frame; deriving from each said set of classifiers a single discriminant value whose magnitude is indicative the presence or absence of said fundamental frequency; maintaining and updating in response to each successive discriminant value a set of parameters S characterising the statistical distribution of the discriminant 25 values; and comparing each current discriminant value with a value derived from the current set of parameters to determine the presence or absence of said fundamental frequency in the current frame.

According to a second aspect of the invention there is provided apparatus for detecting the presence of a fundmenta frequency in frames of speech comprising: means for generating a set of classifiers for each frame In turn defining speech attributes of the corresponding frame; means for deriving from each said set of classifiers a single discriminant value whose magnitude is indicative the presence or absence of said fundamental frequency; i..s t~i

I

3 means for maintaining and updating in response to each successive discriminant value a set of parameters characterising tht statistical distribution of the discriminant values; and means for comparing each current discriminant value with a value derived from E the current set of parameters to determine the presence or absence of said fundamental frequency in the current frame.

Brief Description of the Drawing FIG. 1 illustrates, in block diagram form, the present invention; and FIGS. 2 and 3 illustrate, in greater detail, certain functions performed by the voiced detection apparatus of FIG. 1.

Detailed Description FIG. 1 illustrates an apparatus for performing the unvoiced/voiced decision operation by first utilizing a discriminant 15 voiced detector to process voise classifiers in order to generate a discriminant variable or general variable. The latter variable is statistically analyzed to make the voicing decision. The age statistical analysis adapts the threshold utilized in making the unvoice/voiced decision so as to give reliable performance in a variety of voice environments.

Consider now the overall operation of the apparatus illustrated in FIG. 1. Classifier generator 100 is responsive to each frame of voice to generate classifiers which advantageously may be the log of the speech energy, the log of the LPC gain, the log 25 area ratio of the first reflection coefficient, and the squared correlation coefficient of two speech segments one frame long which are offset by one pitch period. The calculation of these classifiers involves digitally sampling analog speech, forming frames of the digital samples, and processing those frame 1 ,W J

F

r WO 88/07739 PCT/US88/00031 -4and is well known in the art. Generator 100 transmits the classifiers to silence detector 101 and discriminant voiced detector 102 via path 106. Discriminant voiced detector 102 is responsive to the classifiers received via path 106 to calculate the discriminant value, x. Detector 102 performs that calculation by solving the equation: x c'y+d. Advantageously, is a vector comprising the weights, is a vector comprising the classifiers, and is a scalar representing a threshold value. Advantageously, the components of vector c are initialized as follows: component corresponding to log of the speech energy equals 0.3918606, component corresponding to log of the LPC gain equals -0.0520902, component corresponding to log area ratio of the first reflection coefficient equals 0.5637082, and component correspending to squared correlation coefficient equals 1.361249; and d initially equals -8.36454. After calculating the value of the discriminant variable x, the detector 102 transmits this value via path 111 to statistical calculator 103 and subtracter 107.

Silence detector 101 is responsive to the classifiers transmitted via path 106 to determine whether speech is actually present on the data being received on path 109 by classifier generator 100. The indication of the presence of speech is transmitted via path 110 to statistical calculator 103 by silence detector 101.

For each frame of speech, detector 102 generates and transmits the discriminant value x via path 111. Statistical calculator 103 maintains an average of the discriminant values received via path 111 by averaging in the discriminant value for the present, non-silence frame with the discriminant values for previous non-silence frames, Statistical calculator 103 is also responsive to the signal received via path 110 to calculate the overall probability that any frame is unvoiced and the probability that any frame is voiced. In addition, statistical calculator 103 calculates the statistical value that the discriminant value for the present frame would have if the frame was unvoiced and the statistical value that the discriminant value for the present frame wald have if the frame was voiced.

Advantageously, that statistical value may be the mean, The calculations performed by calculator 103 are not only based on the present frame but on previous frames as well, Statistical calculator 103 performs these calculations not only on the basis of the discriminant value received for the present frame via path 106 and the average of the classifiers but also on the basis of a weight and a threshold value defining whether a .rame is unvoiced or voiced received via WO 88/07739 PCT/US88/00031 path 113 from threshold calculator 104.

Calculator 104 is responsive to the probabilities and statistical values of the classifiers for the present frame as generated by calculator 103 and received via path 112 to recalculate the values used as weight value a, and threshold value b for the present frame. Then, these new values of a and b are transmitted back to statistical calculator 103 via path 113.

Calculator 104 transmits the weight, threshold, and statistical values via path 114 to UV determinator 105. The latter detector is responsive to the information transmitted via paths 114 and 115 to determine whether or not the frame is unvoiced or voiced and to transmit this decision via path 116.

Consider now in greater detail the operations of blocks 103, 104, 105, and 107 illustrated in FIG. 1. Statistical calculator 103 impleraents an improved EM algorithm similar to that suggested in the article by N. E, Day entitled "Estimating the Components of a Mixture of Normal Distribitions", Biometrika, Vol. 56, No. 3, pp. 463-474, 1969. Utilizing the concept of a decaying average, calculator 103 calculates the average for the discriminant values for the present and previous frames by calculating following equations 1, 2, and 3: n =n+l if n 2000 (1) z 1/n (2) Xn Xn-1 zxn (3) x, is the discriminant value for present frame and is received from detector 102 via path 111, and n is the number of frames that have been processed up to 2000. z represents the decaying average coefficient, and X, represents the .4' WO 88/07739 PCT/US88/0~031 -6average of the discriminant values for the present and past frames. Statistical calculator 103 is responsive to receipt of the z, Xn and Xn values to calculate the variance value, T, by first calculating the second moment of xn, On, as follows: Q2 After Qn has been calculated, T is calculated as follows: T=Q -x.

The mean is subtracted from the discriminant value of the present frame as follows: xn xn Xn Next, calculator 103 determines the probability that the frame represented by the present value Xn is unvoiced by solving equation 7 shown below: 1 P(u I xn) 1 exp(axn+b) After solving equation 7, calculator 103 determines the probability that the discriminant value represents a voiced frame by solving the following: P(v I x) I-P(u I Xn)

MM

WO 88/07739 PCT/US88/00031 -7- Next, calculator 103 determines the overall probability that any frame will be unvoiced by solving equation 9 for pn: n Pn- z P(u Xn) (9) i After determining the probability that a frame will be unvoiced, calculator 103 determines two values, u and v, which give the mean values of discriminant value for both unvoiced and voiced type frames. Value u, statistical average unvoiced value, contains the mean discriminant value if a frame is unvoiced, and value v, statistical average voiced value, gives the mean discriminant value if a frame is voiced. Value u for the present frame is solved by calculating equation 10, and value v is determined for the present frame by calculating equationll 1 as follows: Un un z Xn P(u I n)/Pn ZXn vn vn-1 z Xn P(v Ixn)/(1-pn) zxn (11) Calculator 103 now communicates the u, v, and T values, and probability Pn to threshold calculator 104 via path 112.

Calculator 104 is responsive to this information to calculate new values for a and b. These new values are then transmitted back to statistical calculator 103 via path 113, 'nis allows rapid adaptations to changing environments, if n is greater than advantageously 99, values a and b are calculated as follows. Value a is determined by solving the following equation: a r(v-u) (12) l-pn(l-Pn) T 1 (un-Vn)

NWMM

err -r WO 88/07739 PCT/US88/00031 V -8- Value b is determined by solving the following equation: 1 b a(un+Vn) log[(1-pn)/pn 2 (13) After calculating equations 12 and 13, calculator 104 transmits values a, u, and v to block 105 via path 114.

Determinator 105 is responsive to this transmitted information to decide whether the present frame is voiced or unvoiced. If the value a is positive, then, a frame is declared voiced if the following equation is true: ax, a(un+Vn)/2 0 (14) or if the value a is negative, then, a frame is declared voiced if the following equation is true; axn a(un+Vn)/2 0 Equation 14 can also be expressed as: axn b log[(l-p)/pn 0 Equation 15 can also be expressed as: axn b log[(1-pn)/Pn] 0

J

WO 88/07739 PCT/US88/00031 -9- If the previous conditions are not met, determinator 105 declares the frame unvoiced.

In flow chart form, FIGS. 2 and 3 illustrate, in greater detail, the operations perfo:rmed by the apparatus of FIG. 1. Block 200 implements block 101 of FIG. 1. Blocks 202 through 218 implement statistical calculator 103. Block 222 implements threshold calculator 104, and blocks 226 through 239 implemert block 105 of FIG. 1, Subtracter 107 is implemented by both block 208 and block 224, Block 202 calculates the value which represents the average of the discriminant value for the present frame and all previous frames. Block 200 determines whether speech is present in the present frame; and if speech is not present in the present frame, the mean for the discriminant value is subtracted from the present discriminant value by block 224 before control is transferred to decision block 226.

However, if speech is present in the present frame, then the statistical and weight calculations are performed by blocks 202 through 222. First, the average value is found in block 202, Second, the second moment value is calculated in block 206. The latter value along with the mean value X for the present and past frames is then utilized to calculate the variance, T, also in block 206, The mean X is then subtracted from the di. ,riminant value Xn in block 208, Block 210 calculates the probability that the present frame is unvoiced by utilizing the present weight value a, the present threshold value b, and the discriminant value for the present frame, X, After calculating th,. probability that the present frame is unvoiced, the probability that the present frame is voiced is calculated by block 212. Then, the overall probability, pn, that any frame will be unvoiced is calculated by block 214, Blocks 216 and 218 calculate two values: u and v. The value u represents the statistical average value that the discriminant value would have if the frame were unvoiced. Whereas, value v represents the statistical average vAlue that the discriminant value would have if the frame were voiced. The actual discriminant values for the present and previ, t imes art clustered around either value u or value v. The discriminant values for the previous and present frames are clustered around value u if these frames had been found to be unvoiced; otherwise, the previous values are clustered around value v. Block 222 then calculates a new weight value a and a new threshold value b. The values a and b WO 88/07739 PCT/US88/00031 are used in the next sequential frame by the preceding blocks in FIG. 2.

Blocks 226 through 239 implement U/V determinator 105 of FIG. 1.

Block 226 determines whether the value a for the present frame is greater than zero. If this condition is true, then decision block 228 is executed. The latter decision block determines whether the test for voiced or unvoiced is met. If the frame is found to be voiced in decision block 228, then the frame is so marked as voiced by block 230 otherwise the frame is marked as unvoiced by block 232. If the value a is less than zero for the present frame, blocks 234 through 238 are executed and function in a similar manner to blocks 228 through 232.

It is to be understood that the afore-described embodiment is merely illustrative of the principles of the invention and that other arrangements may be devised by those skilled in the art without departing from the spirit and the scope of the invention.

Claims

1. A method for detecting the presence of a fundamental frequency in frames of speech comprising: generating a set of classifiers for each frame in turn defining speech attributes of the corresponding frame; deriving from each said set of classifiers a single discriminant value whose magnitude is indicative the presence or absence of said fundamental frequency; maintaining and updating in response to each successive discriminant value a set of parameters characterising the statistical distribution of the discriminant values; and comparing each current discriminant value with a value derived from the current set of parameters to determine the presence or absence of said fundamertal frequency in the current frame. 15

2. A method as claimed in claim 1 including calculating a oleo weight and a threshold value for each frame from the current said set of parameters wherein said maintaining and updating makes use of th cSight and threshold value calculated for the previous frame to update the set of parameters for the current frame.

3. A method as claimed in claim 1 or claim 2 wherein said set (f parameters includes an estimated mean discriminant value for frames including said fundamental frequency and an estimated mean discriminant value for frames not including said fundamental frequency. 25

4. A method as cla',,,ad in claim 3 wherein the said value derived from the current set of parameters is the arithmetic mean of the two said estimated mean discriminant values.

5. Apparatus for detecting the presence of a fundamental frequency in frames of speech comprising: means for generating a set of classifiers for each frame in ,I turn defining speech attributes of the corresponding frame; means for deriving from each said set of classifiers a single discriminanr value whose magnitude is indicative the presence or absence of said fundamental frequency; means for maintaining and updating in response to each m' B r 12 successive discriminant value a set of parameters characterising the statistical distribution of the discriminant values; and means for comparing each current discriminant value with a value derived from the current set of parameters to determine the presence or absence of said fundamental frequency in the current frame.

6. Apparatus as claimed in claim 5 including means for calculating a weight and a threshold value for each frame from the current said set of parameters wherein said means for maintaining and updating makes use of the weight and threshold value calculated for the previous frame to update the set of parameters for the current frame.

7. Apparatus as claimed in claim 5 or claim 6 wherein said ep. srt of parameters includes an estimated mean discriminant value for 15 frames including said fundamental frequency and an estimated mean discriminant value for frames not including said funidamental frequency:

8. Apparatus as claimed in claim 7 wherein the said means for comparing is arranged to derive the said value from the current set of parameters as the arithmetic mean of the two said estimated mean discriminant values. 1 *1 o

9. A method for detecting the presence of a fundamental frequency in frames of speech, substantially as hereinbefore described with reference to the drawings.

10. Apparatus for detecting the presence of a fundamental frequency in frames of speech, substantially Sas hereinbefore described with reference to the drawings. DATED this SECOND day of APRIL 1990 American Telephone Telegraph Company Patent Attorneys for the Applicant SPRUSON FERGUSON CMKW/KW r I t S