EP0309561B1 - An adaptive threshold voiced detector - Google Patents

An adaptive threshold voiced detector Download PDF

Info

Publication number
EP0309561B1
EP0309561B1 EP88903995A EP88903995A EP0309561B1 EP 0309561 B1 EP0309561 B1 EP 0309561B1 EP 88903995 A EP88903995 A EP 88903995A EP 88903995 A EP88903995 A EP 88903995A EP 0309561 B1 EP0309561 B1 EP 0309561B1
Authority
EP
European Patent Office
Prior art keywords
frames
speech
calculating
unvoiced
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
EP88903995A
Other languages
German (de)
French (fr)
Other versions
EP0309561A1 (en
Inventor
David Lynn Thomson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
American Telephone and Telegraph Co Inc
AT&T Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by American Telephone and Telegraph Co Inc, AT&T Corp filed Critical American Telephone and Telegraph Co Inc
Priority to AT88903995T priority Critical patent/ATE83329T1/en
Publication of EP0309561A1 publication Critical patent/EP0309561A1/en
Application granted granted Critical
Publication of EP0309561B1 publication Critical patent/EP0309561B1/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals

Definitions

  • This invention relates to determining whether or not speech contains a fundamental frequency which is commonly referred to as the unvoiced/voiced decision. More particularly, the unvoiced/voiced decision is made by a two stage voiced detector with the final threshold values being adaptively calculated for the speech environment utilizing statistical techniques.
  • a frame of speech is declared voice if a weighted sum of classifiers is greater than a specified threshold, and unvoiced otherwise.
  • the weights and threshold are chosen to maximize performance on a training set of speech where the voicing of each frame is known.
  • a problem associated with the fixed weighted sum method is that it does not perform well when the speech environment changes.
  • the reason is that the threshold is determined from the training set which is different from speech subject to background noise, non-linear distortion, and filtering.
  • the adaptive threshold is decremented by one. After the adaptive threshold has been calculated, it is subtracted from a output of a elementary pitch detector. If the results of the subtraction yield a positive number, the speech frame is declared voice; otherwise, the speech frame is declared on unvoice.
  • the problem with the disclosed method is that the parameters themselves are not used in the elementary pitch detector. Hence, the adjustment of the adaptive threshold is ad hoc and is not directly linked to the physical phenomena from which it is calculated. In addition, the threshold cannot adapt to rapidly changing speech environments.
  • the present invention accordingly provides a method and apparatus for making an adaptive voiced/unvoiced determination for frames of speech as claimed in claims 1, 5 or 8.
  • a voicing decision apparatus that adapts to a changing environment by utilizing adaptive statistical values to make the voicing decision.
  • the statistical values are adapted to the changing environment by utilizing statistics based on an output of a voiced detector.
  • the statistical parameters are calculated by the voiced detector generating a general value indicating the presence of a fundamental frequency in a speech frame in response to speech attributes of the frame.
  • the mean for unvoiced ones and voiced ones of speech frames is calculated in response to the generated value.
  • the two means are then used to determine decision regions, and the determination of the presence of the fundamental frequency is done in response to the decision regions and the present speech frame.
  • the man for unvoiced frames is calculated by calculating the probability that the present speech frame is unvoiced, calculating the overall probability that any frame will be unvoiced, and calculating the probability that the present speech frame is voiced.
  • the mean of the unvoiced speech frames is then calculated in response to the probability that the present speech frame is unvoiced and the overall probability.
  • the mean of the voiced speech frame is calculated in response to the probability that the present speech frame is voiced and the overall probability.
  • the calculations of probabilities are performed utilizing a maximum likelihood statistical operation.
  • the generation of the general value is performed utilizing a discriminant analysis procedure, and the speech attributes are speech classifiers.
  • the decision regions are defined by the mean of the unvoiced and voiced speech frames and a weight and threshold value generated in response to the general values of past and present frames and the means of the voiced and unvoiced frames.
  • the method for detecting the presence of a fundamental frequency in speech frames comprises the steps of: generating a general value in response to a set of classifiers defining speech attributes of a present speech frame to indicate the presence of the fundamental frequency, calculating a set of statistical parameters in response to the general value, and determining the presence of the fundamental frequency in response to the general value and the calculated set of statistical parameters.
  • the step of generating the general value is performed utilizing a discriminant analysis procedure.
  • the step of determining the fundamental frequency comprises the step of calculating a weight and a threshold value in response to the set of parameters.
  • FIG. 1 illustrates an apparatus for performing the unvoiced/voiced decision operation by first utilizing a discriminant voiced detector to process voice classifiers in order to generate a discriminant variable or general variable.
  • the latter variable is statistically analyzed to make the voicing decision.
  • the statistical analysis adapts the threshold utilized in making the unvoiced/voiced decision so as to give reliable performance in a variety of voice environments.
  • Classifier generator 100 is responsive to each frame of voice to generate classifiers which advantageously may be the log of the speech energy, the log of the LPC gain, the log area ratio of the first reflection coefficient, and the squared correlation coefficient of two speech segments one frame long which are offset by one pitch period.
  • the calculation of these classifiers involves digitally sampling analog speech, forming frames of the digital samples, and processing those frames and is well known in the art.
  • Generator 100 transmits the classifiers to silence detector 101 and discriminant voiced detector 102 via path 106.
  • Discriminant voiced detector 102 is responsive to the classifiers received via path 106 to calculate the discriminant value, x.
  • c is a vector comprising the weights
  • y is a vector comprising the classifiers
  • d is a scalar representing a threshold value.
  • the components of vector c are initialized as follows: component corresponding to log of the speech energy equals 0.3918606, component corresponding to log of the LPC gain equals -0.0520902, component corresponding to log area ratio of the first reflection coefficient equals 0.5637082, and component corresponding to squared correlation coefficient equals 1.361249; and d initially equals -8.36454.
  • the detector 102 transmits this value via path 111 to statistical calculator 103 and subtracter 107.
  • Silence detector 101 is responsive to the classifiers transmitted via path 106 to determine whether speech is actually present on the data being received on path 109 by classifier generator 100.
  • the indication of the presence of speech is transmitted via path 110 to statistical calculator 103 by silence detector 101.
  • detector 102 For each frame of speech, detector 102 generates and transmits the discriminant value x via path 111.
  • Statistical calculator 103 maintains an average of the discriminant values received via path 111 by averaging in the discriminant value for the present, non-silence frame with the discriminant values for previous non-silence frames.
  • Statistical calculator 103 is also responsive to the signal received via path 110 to calculate the overall probability that any frame is unvoiced and the probability that any frame is voiced.
  • statistical calculator 103 calculates the statistical value that the discriminant value for the present frame would have if the frame was unvoiced and the statistical value that the discriminant value for the present frame would have if the frame was voiced.
  • that statistical value may be the mean.
  • calculator 103 performs these calculations not only on the basis of the discriminant value received for the present frame via path 106 and the average of the classifiers but also on the basis of a weight and a threshold value defining whether a frame is unvoiced or voiced received via path 113 from threshold calculator 104.
  • Calculator 104 is responsive to the probabilities and statistical values of the classifiers for the present frame as generated by calculator 103 and received via path 112 to recalculate the values used as weight value a, and threshold value b for the present frame. Then, these new values of a and b are transmitted back to statistical calculator 103 via path 113.
  • Calculator 104 transmits the weight, threshold, and statistical values via path 114 to U/V determinator 105.
  • the latter detector is responsive to the information transmitted via paths 114 and 115 to determine whether or not the frame is unvoiced or voiced and to transmit this decision via path 116.
  • Statistical calculator 103 implements an improved EM algorithm similar to that suggested in the article by N. E. Day entitled “Estimating the Components of a Mixture of Normal Distributions", Biometrika, Vol. 56, No. 3, pp. 463-474, 1969.
  • x n ) 1-P(u
  • x n) /p n - zx n (10) v n (1-z) v n-1 + z x n P(v
  • Determinator 105 is responsive to this transmitted information to decide whether the present frame is voiced or unvoiced. If the value a is positive, then, a frame is declared voiced if the following equation is true: ax n - a(u n +v n) /2 > 0 ; (14) or if the value a is negative, then, a frame is declared voiced if the following equation is true: ax n - a(u n +v n) /2 ⁇ 0 . (15) Equation 14 can also be expressed as: ax n + b - log[(1-p n) /p n] > 0 . Equation 15 can also be expressed as: ax n + b - log[(1-p n) /p n ] ⁇ 0 . If the previous conditions are not met, determinator 105 declares the frame unvoiced.
  • FIGS. 2 and 3 illustrate, in greater detail, the operations performed by the apparatus of FIG. 1.
  • Block 200 implements block 101 of FIG. 1.
  • Blocks 202 through 218 implement statistical calculator 103.
  • Block 222 implements threshold calculator 104, and blocks 226 through 239 implement block 105 of FIG. 1.
  • Subtracter 107 is implemented by both block 208 and block 224.
  • Block 202 calculates the value which represents the average of the discriminant value for the present frame and all previous frames.
  • Block 200 determines whether speech is present in the present frame; and if speech is not present in the present frame, the mean for the discriminant value is subtracted from the present discriminant value by block 224 before control is transferred to decision block 226.
  • the statistical and weight calculations are performed by blocks 202 through 222.
  • the average value is found in block 202.
  • the second moment value is calculated in block 206.
  • the latter value along with the mean value X for the present and past frames is then utilized to calculate the variance, T, also in block 206.
  • the mean X is then subtracted from the discriminant value x n in block 208.
  • Block 210 calculates the probability that the present frame is unvoiced by utilizing the present weight value a, the present threshold value b, and the discriminant value for the present frame, x n . After calculating the probability that the present frame is unvoiced, the probability that the present frame is voiced is calculated by block 212. Then, the overall probability, p n , that any frame will be unvoiced is calculated by block 214.
  • Blocks 216 and 218 calculate two values: u and v.
  • the value u represents the statistical average value that the discriminant value would have if the frame were unvoiced.
  • value v represents the statistical average value that the discriminant value would have if the frame were voiced.
  • the actual discriminant values for the present and previous frames are clustered around either value u or value v.
  • the discriminant values for the previous and present frames are clustered around value u if these frames had been found to be unvoiced; otherwise, the previous values are clustered around value v.
  • Block 222 then calculates a new weight value a and a new threshold value b. The values a and b are used in the next sequential frame by the preceding blocks in FIG. 2.
  • Blocks 226 through 239 implement U/V determinator 105 of FIG. 1.
  • Block 226 determines whether the value a for the present frame is greater than zero. If this condition is true, then decision block 228 is executed. The latter decision block determines whether the test for voiced or unvoiced is met. If the frame is found to be voiced in decision block 228, then the frame is so marked as voiced by block 230 otherwise the frame is marked as unvoiced by block 232. If the value a is less than zero for the present frame, blocks 234 through 238 are executed and function in a similar manner to blocks 228 through 232.

Abstract

Apparatus for detecting a fundamental frequency in speech by statistically analyzing a discriminant variable generated by a discriminant voiced detector (102) so as to determine the presence of the fundamental frequency in a changing speech environment. A statistical calculator (103) is responsive to the discriminant variable to first calculate the average of all of the values of the discriminant variable over the present and past speech frames and then to determine the overall probability that any frame will be unvoiced. In addition, the calculator informs two values, one value represents the statistical average of discriminant values that an unvoiced frame's discriminant variable would have and the other value represents the statistical average of the discriminant values for voice frames. These latter calculations are performed utilizing not only the average discriminant value but also a weight value and a threshold value which are adaptively determined by a threshold calculator (104) from frame to frame. An unvoiced/voiced determinator (105) makes the unvoiced/voiced decision by utilizing the weight and threshold values.

Description

    Technical Field
  • This invention relates to determining whether or not speech contains a fundamental frequency which is commonly referred to as the unvoiced/voiced decision. More particularly, the unvoiced/voiced decision is made by a two stage voiced detector with the final threshold values being adaptively calculated for the speech environment utilizing statistical techniques.
  • Background of the Invention
  • In low bit rate voice coders, degradation of voice quality is often due to inaccurate voicing decisions. The difficulty in correctly making these voicing decisions lies in the fact that no single speech parameter or classifier can reliably distinguish voiced speech from unvoiced speech. In order to make the voice decision, it is known in the art to combine multiple speech classifiers in the form of a weighted sum. This method is commonly called discriminant analysis. Such a method is illustrated in D.P. Prezas, et. al., "Fast and Accurate Pitch Detection Using Pattern Recognition and Adaptive Time-Domain Analysis", Proc. IEEE Int. Conf. Acoust., Speech and Signal Proc., Vol. 1, pp. 109-112, April 1986. As described in that article, a frame of speech is declared voice if a weighted sum of classifiers is greater than a specified threshold, and unvoiced otherwise. The weights and threshold are chosen to maximize performance on a training set of speech where the voicing of each frame is known.
  • A problem associated with the fixed weighted sum method is that it does not perform well when the speech environment changes. The reason is that the threshold is determined from the training set which is different from speech subject to background noise, non-linear distortion, and filtering.
  • B.S. Atal and L.R. Rabiner, in "A Pattern Recognition Approach to Voiced-Unvoiced-Silence Classification with Applications to Speech Recognition", IEEE Trans. Acoust., Speech, Signal Processing, vol ASSP-24, no.3, pages 201-212, June 1976, disclose a process for distinguishing between silence, voiced speech and unvoiced speech from a number of classifiers using a minimum-distance criterion in which the distance is defined by means of mean values and covariances of the classifiers for the three classes, determined from a manually classified set of training utterances.
  • P. de Souza, in "A Statistical Approach to the Design of an Adaptive Self-Normalizing Silence Detector", IEEE Trans. Acoust., Speech, Signal Processing, vol ASSP-31, no 3, pages 678-684, June 1983, discloses a process for detecting silence in which the first half-second of an input signal is assumed to be silent and means and covariances of the classifiers are calculated from this small sample. These are then used to detect, using a statistical test, a subsequent half-second of silence, which is then added to the original sample and the means and covariances updated, and so on until ten half-seconds of silence have been collected. The detector then returns to the start of the signal, using the means and covariances derived from the ten half-seconds of silence.
  • One method for adapting the threshold value to changing speech environment is disclosed in the paper of H. Hassanein, et al., "Implementation of the Gold-Rabiner Pitch Detector in a Real Time Environment Using an Improved Voicing Detector", IEEE Transactions on Acoustic, Speech and Signal Processing, 1986, Tokyo, Vol. ASSP-33, No. 1, pp.319-320. This paper discloses an empirical method which compares three different parameters against independent thresholds associated with these parameters and on the basis of each comparison either increments or decrements by one an adaptive threshold value. The three parameters utilized are energy of the signal, first reflection coefficient and zero-crossing count. For example, if the energy of the speech signal is less than a predefined energy level, the adaptive threshold is incremented. On the other hand, if the energy of the speech signal is greater than another predefined energy level, the adaptive threshold is decremented by one. After the adaptive threshold has been calculated, it is subtracted from a output of a elementary pitch detector. If the results of the subtraction yield a positive number, the speech frame is declared voice; otherwise, the speech frame is declared on unvoice. The problem with the disclosed method is that the parameters themselves are not used in the elementary pitch detector. Hence, the adjustment of the adaptive threshold is ad hoc and is not directly linked to the physical phenomena from which it is calculated. In addition, the threshold cannot adapt to rapidly changing speech environments.
  • Solution
  • The present invention accordingly provides a method and apparatus for making an adaptive voiced/unvoiced determination for frames of speech as claimed in claims 1, 5 or 8.
  • The above described problem is solved and a technical advance is achieved by a voicing decision apparatus that adapts to a changing environment by utilizing adaptive statistical values to make the voicing decision. The statistical values are adapted to the changing environment by utilizing statistics based on an output of a voiced detector. The statistical parameters are calculated by the voiced detector generating a general value indicating the presence of a fundamental frequency in a speech frame in response to speech attributes of the frame. Second, the mean for unvoiced ones and voiced ones of speech frames is calculated in response to the generated value. The two means are then used to determine decision regions, and the determination of the presence of the fundamental frequency is done in response to the decision regions and the present speech frame.
  • Advantageously, in response to speech attributes of the present and past speech frames, the man for unvoiced frames is calculated by calculating the probability that the present speech frame is unvoiced, calculating the overall probability that any frame will be unvoiced, and calculating the probability that the present speech frame is voiced. The mean of the unvoiced speech frames is then calculated in response to the probability that the present speech frame is unvoiced and the overall probability. In addition, the mean of the voiced speech frame is calculated in response to the probability that the present speech frame is voiced and the overall probability. Advantageously, the calculations of probabilities are performed utilizing a maximum likelihood statistical operation.
  • Advantageously, the generation of the general value is performed utilizing a discriminant analysis procedure, and the speech attributes are speech classifiers.
  • Advantageously, the decision regions are defined by the mean of the unvoiced and voiced speech frames and a weight and threshold value generated in response to the general values of past and present frames and the means of the voiced and unvoiced frames.
  • The method for detecting the presence of a fundamental frequency in speech frames comprises the steps of: generating a general value in response to a set of classifiers defining speech attributes of a present speech frame to indicate the presence of the fundamental frequency, calculating a set of statistical parameters in response to the general value, and determining the presence of the fundamental frequency in response to the general value and the calculated set of statistical parameters. The step of generating the general value is performed utilizing a discriminant analysis procedure. Further, the step of determining the fundamental frequency comprises the step of calculating a weight and a threshold value in response to the set of parameters.
  • Brief Description of the Drawing
    • FIG. 1 illustrates, in block diagram form, the present invention; and
    • FIGS. 2 and 3 illustrate, in greater detail, certain functions performed by the voiced detection apparatus of FIG. 1.
    Detailed Description
  • FIG. 1 illustrates an apparatus for performing the unvoiced/voiced decision operation by first utilizing a discriminant voiced detector to process voice classifiers in order to generate a discriminant variable or general variable. The latter variable is statistically analyzed to make the voicing decision. The statistical analysis adapts the threshold utilized in making the unvoiced/voiced decision so as to give reliable performance in a variety of voice environments.
  • Consider now the overall operation of the apparatus illustrated in FIG. 1. Classifier generator 100 is responsive to each frame of voice to generate classifiers which advantageously may be the log of the speech energy, the log of the LPC gain, the log area ratio of the first reflection coefficient, and the squared correlation coefficient of two speech segments one frame long which are offset by one pitch period. The calculation of these classifiers involves digitally sampling analog speech, forming frames of the digital samples, and processing those frames and is well known in the art. Generator 100 transmits the classifiers to silence detector 101 and discriminant voiced detector 102 via path 106. Discriminant voiced detector 102 is responsive to the classifiers received via path 106 to calculate the discriminant value, x. Detector 102 performs that calculation by solving the equation: x = c'y+d
    Figure imgb0001
    . Advantageously, "c" is a vector comprising the weights, "y" is a vector comprising the classifiers, and "d" is a scalar representing a threshold value. Advantageously, the components of vector c are initialized as follows: component corresponding to log of the speech energy equals 0.3918606, component corresponding to log of the LPC gain equals -0.0520902, component corresponding to log area ratio of the first reflection coefficient equals 0.5637082, and component corresponding to squared correlation coefficient equals 1.361249; and d initially equals -8.36454. After calculating the value of the discriminant variable x, the detector 102 transmits this value via path 111 to statistical calculator 103 and subtracter 107.
  • Silence detector 101 is responsive to the classifiers transmitted via path 106 to determine whether speech is actually present on the data being received on path 109 by classifier generator 100. The indication of the presence of speech is transmitted via path 110 to statistical calculator 103 by silence detector 101.
  • For each frame of speech, detector 102 generates and transmits the discriminant value x via path 111. Statistical calculator 103 maintains an average of the discriminant values received via path 111 by averaging in the discriminant value for the present, non-silence frame with the discriminant values for previous non-silence frames. Statistical calculator 103 is also responsive to the signal received via path 110 to calculate the overall probability that any frame is unvoiced and the probability that any frame is voiced. In addition, statistical calculator 103 calculates the statistical value that the discriminant value for the present frame would have if the frame was unvoiced and the statistical value that the discriminant value for the present frame would have if the frame was voiced. Advantageously, that statistical value may be the mean. The calculations performed by calculator 103 are not only based on the present frame but on previous frames as well. Statistical calculator 103 performs these calculations not only on the basis of the discriminant value received for the present frame via path 106 and the average of the classifiers but also on the basis of a weight and a threshold value defining whether a frame is unvoiced or voiced received via path 113 from threshold calculator 104.
  • Calculator 104 is responsive to the probabilities and statistical values of the classifiers for the present frame as generated by calculator 103 and received via path 112 to recalculate the values used as weight value a, and threshold value b for the present frame. Then, these new values of a and b are transmitted back to statistical calculator 103 via path 113.
  • Calculator 104 transmits the weight, threshold, and statistical values via path 114 to U/V determinator 105. The latter detector is responsive to the information transmitted via paths 114 and 115 to determine whether or not the frame is unvoiced or voiced and to transmit this decision via path 116.
  • Consider now in greater detail the operations of blocks 103, 104, 105, and 107 illustrated in FIG. 1. Statistical calculator 103 implements an improved EM algorithm similar to that suggested in the article by N. E. Day entitled "Estimating the Components of a Mixture of Normal Distributions", Biometrika, Vol. 56, No. 3, pp. 463-474, 1969. Utilizing the concept of a decaying average, calculator 103 calculates the average for the discriminant values for the present and previous frames by calculating following equations 1, 2, and 3:

    n = n+1 if n < 2000   (1)
    Figure imgb0002


    z = 1/n   (2)
    Figure imgb0003


    X n = (1-z) X n-1 + zx n    (3)
    Figure imgb0004


    xn is the discriminant value for the present frame and is received from detector 102 via path 111, and n is the number of frames that have been processed up to 2000. z represents the decaying average coefficient, and Xn represents the average of the discriminant values for the present and past frames. Statistical calculator 103 is responsive to receipt of the z, xn and Xn values to calculate the variance value, T, by first calculating the second moment of xn, Qn, as follows:
    Figure imgb0005

    After Qn has been calculated, T is calculated as follows:
    Figure imgb0006

    The mean is subtracted from the discriminant value of the present frame as follows:

    x n = x n - X n    (6)
    Figure imgb0007


    Next, calculator 103 determines the probability that the frame represented by the present value xn is unvoiced by solving equation 7 shown below:
    Figure imgb0008

    After solving equation 7, calculator 103 determines the probability that the discriminant value represents a voiced frame by solving the following:

    P(v|x n ) = 1-P(u|x n ) .   (8)
    Figure imgb0009


    Next, calculator 103 determines the overall probability that any frame will be unvoiced by solving equation 9 for pn:

    p n = (1-z) p n-1 + z P(u|x n) .   (9)
    Figure imgb0010


       After determining the probability that a frame will be unvoiced, calculator 103 determines two values, u and v, which give the mean values of discriminant value for both unvoiced and voiced type frames. Value u, statistical average unvoiced value, contains the mean discriminant value if a frame is unvoiced, and value v, statistical average voiced value, gives the mean discriminant value if a frame is voiced. Value u for the present frame is solved by calculating equation 10, and value v is determined for the present frame by calculating equation 11 as follows:

    u n = (1-z) u n-1 + z x n P(u|x n) /p n - zx n    (10)
    Figure imgb0011


    v n = (1-z) v n-1 + z x n P(v|x n) /(1-p n) - zx n    (11)
    Figure imgb0012


    Calculator 103 now communicates the u, v, and T values, and probability pn to threshold calculator 104 via path 112.
  • Calculator 104 is responsive to this information to calculate new values for a and b. These new values are then transmitted back to statistical calculator 103 via path 113. This allows rapid adaptations to changing environments. If n is greater than advantageously 99, values a and b are calculated as follows. Value a is determined by solving the following equation:
    Figure imgb0013

    Value b is determined by solving the following equation:

    b = - 1 2 a(u n +v n ) + log[(1-p n )/p n ] .   (13)
    Figure imgb0014


    After calculating equations 12 and 13, calculator 104 transmits values a, u, and v to block 105 via path 114.
  • Determinator 105 is responsive to this transmitted information to decide whether the present frame is voiced or unvoiced. If the value a is positive, then, a frame is declared voiced if the following equation is true:

    ax n - a(u n +v n) /2 > 0 ;   (14)
    Figure imgb0015


    or if the value a is negative, then, a frame is declared voiced if the following equation is true:

    ax n - a(u n +v n) /2 < 0 .   (15)
    Figure imgb0016


    Equation 14 can also be expressed as:

    ax n + b - log[(1-p n) /p n] > 0 .
    Figure imgb0017


    Equation 15 can also be expressed as:

    ax n + b - log[(1-p n) /p n ] < 0 .
    Figure imgb0018


    If the previous conditions are not met, determinator 105 declares the frame unvoiced.
  • In flow chart form, FIGS. 2 and 3 illustrate, in greater detail, the operations performed by the apparatus of FIG. 1. Block 200 implements block 101 of FIG. 1. Blocks 202 through 218 implement statistical calculator 103. Block 222 implements threshold calculator 104, and blocks 226 through 239 implement block 105 of FIG. 1. Subtracter 107 is implemented by both block 208 and block 224. Block 202 calculates the value which represents the average of the discriminant value for the present frame and all previous frames. Block 200 determines whether speech is present in the present frame; and if speech is not present in the present frame, the mean for the discriminant value is subtracted from the present discriminant value by block 224 before control is transferred to decision block 226.
  • However, if speech is present in the present frame, then the statistical and weight calculations are performed by blocks 202 through 222. First, the average value is found in block 202. Second, the second moment value is calculated in block 206. The latter value along with the mean value X for the present and past frames is then utilized to calculate the variance, T, also in block 206. The mean X is then subtracted from the discriminant value xn in block 208.
  • Block 210 calculates the probability that the present frame is unvoiced by utilizing the present weight value a, the present threshold value b, and the discriminant value for the present frame, xn. After calculating the probability that the present frame is unvoiced, the probability that the present frame is voiced is calculated by block 212. Then, the overall probability, pn, that any frame will be unvoiced is calculated by block 214.
  • Blocks 216 and 218 calculate two values: u and v. The value u represents the statistical average value that the discriminant value would have if the frame were unvoiced. Whereas, value v represents the statistical average value that the discriminant value would have if the frame were voiced. The actual discriminant values for the present and previous frames are clustered around either value u or value v. The discriminant values for the previous and present frames are clustered around value u if these frames had been found to be unvoiced; otherwise, the previous values are clustered around value v. Block 222 then calculates a new weight value a and a new threshold value b. The values a and b are used in the next sequential frame by the preceding blocks in FIG. 2.
  • Blocks 226 through 239 implement U/V determinator 105 of FIG. 1. Block 226 determines whether the value a for the present frame is greater than zero. If this condition is true, then decision block 228 is executed. The latter decision block determines whether the test for voiced or unvoiced is met. If the frame is found to be voiced in decision block 228, then the frame is so marked as voiced by block 230 otherwise the frame is marked as unvoiced by block 232. If the value a is less than zero for the present frame, blocks 234 through 238 are executed and function in a similar manner to blocks 228 through 232.

Claims (11)

  1. An apparatus for making a voiced/unvoiced determination for frames of speech of unknown voicing, comprising:
       means (101) for detecting silence to select frames of speech of unknown voicing;
       means (102) responsive to a set of classifiers from a classifier generator (100) defining speech attributes of one of said frames of speech of unknown voicing for generating a general value initially indicating voiced or unvoiced
       means (103) responsive to said general value for calculating a set of statistical parameters; and
       means (104) for calculating a threshold value in response to said set of said parameters;
       means (104) for calculating a weight value in response to said set of said parameters;
       means (105) responsive to said weight value and said threshold value and the calculated set of statistical parameters for determining voiced/unvoiced in said present one of said frames of speech of unknown voicing;
       Characterized by means (113) for communicating said weight value and said threshold value to said means (103) for calculating said set of parameters to be used for calculating another set of parameters for a subsequent one of said frames of speech of unknown voicing.
  2. The apparatus of claim 1 wherein said generating means (102) comprises means for performing a discriminant analysis to generate said general value.
  3. The apparatus of claim 2 wherein said means (104) for calculating said set of parameters is further responsive to the communicated weight and threshold values and another general value of said other one of said frames for calculating another set of statistical parameters.
  4. The apparatus of claim 3 wherein said means (104) for calculating said set of parameters further comprises means for calculating the average of said general values over said present and previous ones of said speech frames; and
       means responsive to said average of said general values for said present and previous ones of said speech frames and said communicated weight and threshold values and said other general value for determining said other set of statistical parameters.
  5. An apparatus for making a voiced/unvoiced determination for frames of speech, comprising:
       means (101) for detecting silence to select frames of speech;
       means (102) responsive to a set of classifiers from a classifier generator (100) defining speech attributes of each of a present and past ones said frames of speech for generating a general value initially indicating voiced or unvoiced;
       means for calculating (206) the variance of said general values over said present and previous ones of said speech frames;
       means responsive to present and past ones of said frames for calculating (210) the probability that said present one of said frames is unvoiced;
       means for calculating (212) the probability that said present one of said frames is voiced;
       Characterized by means responsive to said present and past ones of said frames and said probability that said present one of frames is unvoiced for calculating (214) the overall probability that any frame will be unvoiced;
       means responsive to said probability that said present one of said frames is unvoiced and said overall probability and said variance for calculating (216) a mean of said unvoiced ones of said frames;
       means responsive to said probability that said present one of said frames is voiced and said overall probability and said variance for calculating (218) a mean of said voiced ones of said frames;
       means responsive to said mean for unvoiced ones of said frames and said mean of voiced ones of said frames and said variance for determining (222) decision regions; and
       means (105) for making the voiced/unvoiced determination in response to said decision regions for said present one of said frames.
  6. The apparatus of claim 5 wherein said means for calculating said probability that said present one of said frames is unvoiced performs a maximum likelihood statistical operation.
  7. The apparatus of claim 6 wherein said means for calculating said probability that said present one of said frames is unvoiced further responsive to a weight threshold value to perform said maximum likelihood statistical operation.
  8. A method for making a voiced/unvoiced determination for frames of speech of unknown voicing comprising the steps of:
       detecting silence (200) in order to select frames of speech of unknown voicing;
       generating a general value in response to a set of classifiers from a classifier generator defining speech attributes of one of said frames of speech of unknown voicing to initially indicate voiced/unvoiced determination
       calculating (103) a set of statistical parameters in response to said general value; and
       calculating (104) a threshold value in response to said set of said parameters;
       calculating (104) a weight value in response to said set of said parameters; and
       determining (105) voice/unvoiced speech in said present one of said frames of speech of unknown voicing in response to said weight value and said threshold value and the calculated set of statistical parameters;
       Characterized by feeding back (113) said weight value and said threshold value for calculating another set of parameters for a subsequent one of said frames of speech.
  9. The method of claim 8 wherein said step of generating comprises the step of performing a discriminant analysis to generate said general value.
  10. The method of claim 9 wherein said step of calculating said set of parameters further responsive to the communicated weight and threshold values and another general value of said other one of said frames for calculating another set of statistical parameters.
  11. The method of claim 10 wherein said step of calculating said set of parameters further comprises the steps of calculating the average of said general values over said present and previous ones of said speech frames; and
       determining said other set of statistical parameters in response to said average of said general values for said present and previous ones of said speech frames and said communicated weight and threshold values and said other general values.
EP88903995A 1987-04-03 1988-01-12 An adaptive threshold voiced detector Expired - Lifetime EP0309561B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AT88903995T ATE83329T1 (en) 1987-04-03 1988-01-12 DETECTOR FOR VOICED LOUD WITH ADAPTIVE THRESHOLD.

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US3429887A 1987-04-03 1987-04-03
US34298 1987-04-03

Publications (2)

Publication Number Publication Date
EP0309561A1 EP0309561A1 (en) 1989-04-05
EP0309561B1 true EP0309561B1 (en) 1992-12-09

Family

ID=21875533

Family Applications (1)

Application Number Title Priority Date Filing Date
EP88903995A Expired - Lifetime EP0309561B1 (en) 1987-04-03 1988-01-12 An adaptive threshold voiced detector

Country Status (9)

Country Link
EP (1) EP0309561B1 (en)
JP (1) JPH0795239B2 (en)
AT (1) ATE83329T1 (en)
AU (1) AU598933B2 (en)
CA (1) CA1336208C (en)
DE (1) DE3876569T2 (en)
HK (1) HK21794A (en)
SG (1) SG60993G (en)
WO (1) WO1988007739A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH01502779A (en) * 1987-04-03 1989-09-21 アメリカン テレフォン アンド テレグラフ カムパニー Adaptive multivariate estimator
US5195138A (en) * 1990-01-18 1993-03-16 Matsushita Electric Industrial Co., Ltd. Voice signal processing device
US5204906A (en) * 1990-02-13 1993-04-20 Matsushita Electric Industrial Co., Ltd. Voice signal processing device
EP0459384B1 (en) * 1990-05-28 1998-12-30 Matsushita Electric Industrial Co., Ltd. Speech signal processing apparatus for cutting out a speech signal from a noisy speech signal

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS60114900A (en) * 1983-11-25 1985-06-21 松下電器産業株式会社 Voice/voiceless discrimination
JPS60200300A (en) * 1984-03-23 1985-10-09 松下電器産業株式会社 Voice head/end detector
JPS6148898A (en) * 1984-08-16 1986-03-10 松下電器産業株式会社 Voice/voiceless discriminator for voice
JPH01502779A (en) * 1987-04-03 1989-09-21 アメリカン テレフォン アンド テレグラフ カムパニー Adaptive multivariate estimator

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Atal, Rabiner: "A pattern recognition approach to voiced/unvoiced/ silence classification ..." IEEE ASSP 24 No 3, 1976, Prezas et al as cited on p. 1 of description. Note: these documents were cited in the search report of a parallel application (88901684.6) by the same applicant *

Also Published As

Publication number Publication date
WO1988007739A1 (en) 1988-10-06
HK21794A (en) 1994-03-18
EP0309561A1 (en) 1989-04-05
ATE83329T1 (en) 1992-12-15
JPH0795239B2 (en) 1995-10-11
DE3876569D1 (en) 1993-01-21
JPH01502858A (en) 1989-09-28
AU598933B2 (en) 1990-07-05
SG60993G (en) 1993-07-09
AU1700788A (en) 1988-11-02
DE3876569T2 (en) 1993-04-08
CA1336208C (en) 1995-07-04

Similar Documents

Publication Publication Date Title
EP0694906B1 (en) Method and system for speech recognition
US6993481B2 (en) Detection of speech activity using feature model adaptation
US4821325A (en) Endpoint detector
EP1083542B1 (en) A method and apparatus for speech detection
EP0335521B1 (en) Voice activity detection
JPH08505715A (en) Discrimination between stationary and nonstationary signals
JP2000099080A (en) Voice recognizing method using evaluation of reliability scale
US5007093A (en) Adaptive threshold voiced detector
US4937870A (en) Speech recognition arrangement
JP3105465B2 (en) Voice section detection method
US5046100A (en) Adaptive multivariate estimating apparatus
EP0309561B1 (en) An adaptive threshold voiced detector
EP0308433B1 (en) An adaptive multivariate estimating apparatus
US4972490A (en) Distance measurement control of a multiple detector system
JP4673828B2 (en) Speech signal section estimation apparatus, method thereof, program thereof and recording medium
EP0310636B1 (en) Distance measurement control of a multiple detector system
JP2002258881A (en) Device and program for detecting voice
Yang et al. A reliable postprocessor for pitch determination algorithms.
KR970003035B1 (en) Pitch information detecting method of speech signal
Vlaj et al. Usage of frame dropping and frame attenuation algorithms in automatic speech recognition systems
Dal Degan et al. AUTocoRRELATION FUNCTION

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE DE FR GB IT NL

17P Request for examination filed

Effective date: 19890328

17Q First examination report despatched

Effective date: 19910408

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AT BE DE FR GB IT NL

REF Corresponds to:

Ref document number: 83329

Country of ref document: AT

Date of ref document: 19921215

Kind code of ref document: T

ET Fr: translation filed
REF Corresponds to:

Ref document number: 3876569

Country of ref document: DE

Date of ref document: 19930121

ITF It: translation for a ep patent filed

Owner name: MODIANO & ASSOCIATI S.R.L.

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed
PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20011221

Year of fee payment: 15

REG Reference to a national code

Ref country code: GB

Ref legal event code: IF02

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: NL

Payment date: 20020107

Year of fee payment: 15

Ref country code: GB

Payment date: 20020107

Year of fee payment: 15

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: BE

Payment date: 20020114

Year of fee payment: 15

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: AT

Payment date: 20020115

Year of fee payment: 15

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 20020328

Year of fee payment: 15

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20030112

Ref country code: AT

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20030112

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: BE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20030131

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: NL

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20030801

Ref country code: DE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20030801

GBPC Gb: european patent ceased through non-payment of renewal fee
PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FR

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20030930

NLV4 Nl: lapsed or anulled due to non-payment of the annual fee

Effective date: 20030801

REG Reference to a national code

Ref country code: FR

Ref legal event code: ST

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IT

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES;WARNING: LAPSES OF ITALIAN PATENTS WITH EFFECTIVE DATE BEFORE 2007 MAY HAVE OCCURRED AT ANY TIME BEFORE 2007. THE CORRECT EFFECTIVE DATE MAY BE DIFFERENT FROM THE ONE RECORDED.

Effective date: 20050112