AU599459B2 - An adaptive multivariate estimating apparatus - Google Patents
An adaptive multivariate estimating apparatus Download PDFInfo
- Publication number
- AU599459B2 AU599459B2 AU12226/88A AU1222688A AU599459B2 AU 599459 B2 AU599459 B2 AU 599459B2 AU 12226/88 A AU12226/88 A AU 12226/88A AU 1222688 A AU1222688 A AU 1222688A AU 599459 B2 AU599459 B2 AU 599459B2
- Authority
- AU
- Australia
- Prior art keywords
- classifiers
- frames
- speech
- statistical
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 230000003044 adaptive effect Effects 0.000 title abstract description 7
- 238000000034 method Methods 0.000 claims abstract description 28
- 238000004364 calculation method Methods 0.000 claims abstract description 15
- 238000009826 distribution Methods 0.000 claims description 34
- 238000012549 training Methods 0.000 claims description 23
- 239000011159 matrix material Substances 0.000 claims description 12
- 230000004044 response Effects 0.000 claims description 12
- 230000002401 inhibitory effect Effects 0.000 claims 2
- 239000013598 vector Substances 0.000 abstract description 69
- 230000008569 process Effects 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)
- Feedback Control In General (AREA)
- Paper (AREA)
- Bridges Or Land Bridges (AREA)
- Medicines Containing Antibodies Or Antigens For Use As Internal Diagnostic Agents (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
- Measuring Pulse, Heart Rate, Blood Pressure Or Blood Flow (AREA)
- Radar Systems Or Details Thereof (AREA)
- Measurement Of Radiation (AREA)
Abstract
Apparatus for detecting a fundamental frequency in speech in a changing speech environment by using adaptive statistical techniques. A statistical voice detector (103) detects changes in the voice environment by classifiers that define certain attributes of the speech to recalculate weights that are used to combine the classifiers in making the unvoiced/voiced decision that specifies whether the speech has a fundamental frequency or not. The detector is responsive to classifiers to first calculate the average of the classifiers (202) and then to determine the overall probability that any frame will be unvoiced. In addition, the detector using a statistical calculator (203) forms two vectors, one vector represents the statistical average of values that an unvoiced frame's classifiers would have and the other vector represents the statistical average of the values of the classifiers for a voiced frame. These latter calculations are performed utilizing not only the average value of the classifiers and present classifiers but also a vector defining the weights that are utilized to determine whether a frame is unvoiced or not plus a threshold value. A weights calculator (204) is responsive to the information generated in the statistical calculations to generate a new set of values for the weights vector and the threshold value which are utilized by the statistical calculator during the next frame. An unvoiced/voiced determinator (205) then is reponsive to the two statistical average vectors and the weights vector to make the unvoiced/voiced decision.
Description
ii11111 I III_ IT-F-F I I I i I_ L~~IIC. -g i'-
I
i U-AI12226/88 PCT WORLD INTELbD 0ICU -PROP YO ANI, JONA INTERNATIONAL APPLICATION PUBLISHED UNDER THE FATE CPERATION TREATY (PCT) (51) International Patent Classification 4 (11) International Publication Number. WO 88/ 07738 3/00 Al (43) International Publication Date: 6 October 1988 (06.10.88) (21) International Application Number: PCT/US88/00030 (81) Designated States: AT (European patent), AU, BE (European patent), CH (European patent), DE (Euro- (22) International Filing Date: 12 January 1988 (12.01.88) pean patent), FR (European patent), GB (European patent), IT (European patent), JP, LU (European patent), NL (European patent), SE (European patent).
(31) Priority Application Number: 034,296 (32) Priority Date: 3 April 1987 (03.04.87) Published With international search report.
(33) Priority Country:
US
(71) Applicant: AMERICAN TELEPHONE TELE- GRAPH COMPANY [US/US]; 550 Madison Avenue, New York, NY 10022 A. J. P, 1 DEC 1988 (72) Inventor: THOMSON, David, Lynn 29W543 Country Ridge Drive, Apartment D, Warrenville, IL 60555
AUSTRALIAN
(74) Agents: HIRSCH, Jr. et al.; Post Office Box 679, 2 NOV 1988 Holmdel, NJ 07733 2 NV 19 PATENT OFFICE This document contains the amendments made under Section 49 and is correct for printing.
(54) Title: AN ADAPTIVE MULTIVARIATE ESTIMATING APPARATUS 103 STATISTICAL VOICED DETECTOR (57) Abstract Apparatus for detecting a fundamental frequency in speech in a changing speech environment by using adaptive statistical techniques. A statistical voice detector (103) detects changes in the voice environment by classifiers that define certain attributes of the speech to recalculate weights that are used to combine the classifiers in making the unvoiced/voiced decision that specifies whether the speech has a fundamental frequency or not. The detector is responsive to classifiers to first calculate the average of the classifiers (202) and then to determine the overall probability that any frame will be unvoiced. In addition, the detector using a statistical calculator (203) forms two vectors, one vector represents the statistical average of values that an unvoiced frame's classifiers would have and the other vector represents the statistical average of the values of the classifiers for a voiced frame. These latter calculations are performed utilizing not only the average value of the classifiers and present classifiers but also a vector defining the weights that are utilized to determine whether a frame is unvoiced or not plus a threshold value. A weights calculator (204) is responsive to the information generated in the statistical calculations to generate a new set of values for the weights vector and the threshold value which are utilized by the statistical calculator during the next frame. An unvoiced/voiced determinator (205) then is reponsive to the two statistical average vectors and the weights vector to make the unvoiced/voiced decision.
t I WO 88/07738 PCT/US88/00030 -1- AN ADAPTIVE MULTIVARIATE ESTIMATING APPARATUS Technical Field This invention relates to classifying samples representing a real time process into groups with each group corresponding to a state of the real time process. In particular, the classifying is done in real time as each sample is generated using statistical techniques.
Background and Problem In many real time processes, a problem exists in attempting to estimate the present state of the process in a changing environment from present and past samples of the process. One example of such a process is the generation of speech by the human vocal tract. The sound produced by the vocal tract can have a fundamental frequency voiced state or no fundamental frequency unvoiced state. Further, a third state may exist if no sound is being produced silence state. The problem of determining these three states is referred to as the voicing/silence decision. In low bit rate voice coders, degradation of voice quality is often due to inaccurate voicing decisions. The difficulty in correctly making these voicing decisions lies in the fact that no single speech parameter or classifier can reliably distinguish voiced speech from unvoiced speech. In order to make the voicing decision, it is known in the art to combine multiple speech classifiers in the form of a weighted sum. Such a method is illustrated in D. P. Prezas, et al., "Fast and Accurate Pitch Detection Using Pattern Recognition and Adaptive Time-Domain Analysis," Proc. IEEE Int. Conf. Acoust., Speech and Signal Proc., Vol. 1, pp. 109-112, April 1986. As described in that article, a frame of speech is declared voiced if a weighted sum of speech classifiers is greater than a specified threshold; and unvoiced otherwise. Mathematically, this relationship may be expressed as a'x b 0 where is a vector comprising the weights, is a vector comprising the classifiers, and is a scalar representing the threshold value. The weights are chosen to maximize performance on a training set of speech where the voicing of each frame is known. These weights form a decision rule which provides significant speech quality improvements in speech coders compared to those using a single parameter.
A problem associated with the fixed weighted sum method is that it does not perform well when the speech environment changes. Such changes in the speech environment may be a result of a telephone conversation being carried -2on in a car via a mobile telephone or maybe due to different telephone transmitters. The reason that the fixed weighted sum methods do not perform well in changing environments is that many speech classifiers are influenced by background noise, non-linear distortion, and filtering. If voicing is to be determined for speech with characteristics different from that of the training set, the weights, in general, will not yield satisfactory results.
One method for adapting the fixed weighted sum method to changing speech environment is disclosed in the paper of J.P.
Campbell, et el., "Voiced/Unvoiced Classification of Speech with Application to the U.S. Government LPC-1OE Algorithm," IEEE International Conference on Acoustics, Speech and Signal Processing, 1986, Tokyo, Vol. 9.11.4, pp. 473-476. This paper discloses the utilization of different sets of weights and threshold values each of which has been predetermined from the same set of training data with different levels of white noise being added to the training data for each set of weights and threshold value. For each frame, the speech samples are processed by a set of weights and a threshold value after the results of one of these sets is chosen on the basis of the value of a signal-to-noise-ratio, SNR. The range of possible values that the SNR can have is subdivided into subranges with each subrange being assigned to one of the sets. For each frame, the SNR is calculated; the subrange is determined; and then the detector associated with this subrange is used to determine whether the frame is unvoiced/voiced. The problem with this method is that it is only valid for the training data plus white noise and cannot adapt to a wide range of speech environments and speakers. Therefore, there exists a need for a voiced detector that can reliably determine whether speech is unvoiced or voiced for a varying environment and different speakers.
Solution According to a first aspect of the invention there is provided apparatus for determining the presence of a fundamental frequency in frames of non-training set speech, comprising: first means responsive to a set of classifiers defining speech attributes 3of a present one of said frames of non-training set speech and threshold values and sets of weights for previous frames for calculating a set of statistical distributions; second means responsive to the calculated set of statistical distributions for calculating a set of weights each associated with one of said classifiers for said present frame; and third means responsive to the calculated set of weights and classifiers and said set of statistical distributions for determining the presence of said fundamental frequency in said present frame of non-training set speech.
Said second means preferably comprises means for calculating a threshold value in response to said set of said statistical distributions; and means for communicating said set of said weights and said threshold value to said first means to be used for 15 calculating another set of statistical distributions for another one of said frames of speech.
According to a second aspect of the invention there is provided a method for determining the presence of a fundamental frequency in frames of non-training set speech, comprising: calculating a set of statistical distributions in response to a set of classifiers defining speech attributes of a present one of said *frames and threshold values and sets of weights for previous frames on non-training set speech; calculating a set of weights each associated with one of said classifiers in response to the calculated set of statistical distributions for said present frame; and determining the presence of said fundamental frequency in said present one of said frames of non-training set speech in response to m the calculated set of weights and classifiers and said set of Sstatistical distributions.
_L
-4- Brief Description of the Drawing The invention may be better understood from the following detailed description which when read with the reference to the drawing in which: FIG. 1 is a block diagram of an apparatus using the present invention; FIG. 2 illustrates, in block diagram form, the present invention; FIGS. 3 and 4 illustrate, in greater detail, the functions performed by statistical voiced detector 103 of FIG. 2; and FIG. 5 illustrates, in greater detail, functions performed by block 340 of FIG. 4.
9Detailed Description FIG. 1 illustrates an apparatus for performing the 15 unvoiced/voiced decision operation using as one of the voiced 9detectors a statistical voiced detector which is the subject of this invention. The apparatus of FIG. 1 utilizes two types of detectors; discriminant and statistical voiced detectors. Statistical voiced detector 103 is an adaptive detector that detects changes in the voice environment and modifies the weights used to process classifiers coming from classifier generator 101 so as to more .9o accurately make the unvoiced/voiced decision. Discriminant voice detector 102 is utilized during initial start up to rapidly changing voice environment conditions when statistical voice detector 103 has not yet fully adapted t" the initial or new voice environment.
Consider now the overall operation of the apparatus illustrated in FIG 1. Classifier generator 101 is responsive to each frame of speech to generate classifiers which advantageously may be the log of the speech energy, the log of the LPC gain, the log area ratio of the first reflection coefficient, and the squared correlation coefficient of two speech segments one frame long which are offset by one pitch period. The calculation of these classifiers involves digitally sampling J YA4 TAf V WO 88/07738 PCT/US88/00030 analog speech, forming frames of the digital samples, and processing those frames and is well known in the art. Generator 101 transmits the classifiers to detectors 102 and 103 via path 106.
Detectors 102 and 103 are responsive to the classifiers received via path 106 to make unvoiced/voiced decisions and transmit these decisions via paths 107 and 110, respectively, to multiplexer 105. In addition, the detectors determine a distance measure between voiced and unvoiced frames and transmit these distances via paths 108 and 109 to comparator 104. Advantageously, these distances may be Mahalanobis distances or other generalized distances.
Comparator 104 is responsive to the distances received via paths 108 and 109 to control multiplexer 105 so that the latter multiplexer selects the output of the detector that is generating the largest distance.
FIG. 2 illustrates, in greater detail, statistical voiced detector 103. For each frame of speech, a set of classifiers also referred to as a vector of classifiers is received via path 106 from classifier generator 101. Silence detector 201 is responsive to these classifiers to determine whether or not speech is present in the present frame. If speech is present, detector 201 transmits a signal via path 210.
If no speech (silence) is present in the frame, then only subtractor 207 and U/V determinator 205 are operational for that particular frame. Whether speech is present or not, the unvoiced/voiced decision is made for every frame by determinator 205.
In response to the signal from detector 201, classifier averager 202 maintains an average of the individual classifiers received via path 106 by averaging in the classifiers for the present frame with the classifiers for previous frames. If speech (non-silence) is present in the frame, silence detector 201 signals statistical calculator 203, generator 206, and averager 202 via path 210.
Statistical calculator 203 calculates statistical distributions for voiced and unvoiced frames. In particular, calculator 203 is responsive to the signal received via path 210 to calculate the overall probability that any frame is unvoiced and the probability that any frame is voiced. In addition, statistical calculator 203 calculates the statistical value that each classifier would have if the frame was unvoiced and the statistical value that each classifier would have if the frame was voiced. Further, calculator 203 calculates the covariance matrix of the classifiers. Advantageously, that statistical value may be the mean. The calculations performed by calculator 203 are not only based on the present frame WO 88/07738 PCT/US88/00030 -6but on previous frames as well. Statistical calculator 203 performs these calculations not only on the basis of the classifiers received for the present frame via path 106 and the average of the classifiers received path 211 but also on the basis of the weight for each classifiers and a threshold value defining whether a frame is unvoiced or voiced received via path 213 from weights calculator 204.
Weights calculator 204 is responsive to the probabilities, covariance matrix, and statistical values of the classifiers for the present frame as generated by calculator 203 and received via path 212 to recalculate the values used as weight vector a, for each of the classifiers and the threshold value b, for the present frame. Then, these new values of a and b are transmitted back to statistical calculator 203 via path 213.
Also, weights calculator 204 transmits the weights and the statistical values for the classifiers in both the unvoiced and voiced regions via path 214, determinator 205, and path 208 to generator 206. The latter generator is responsive to this information to calculate the distance measure which is subsequently transmitted via path 109 to comparator 104 as illustrated in FIG. 1.
U/V determinator 205 is responsive to the information transmitted via paths 214 and 215 to determine whether or not the frame is unvoiced or voiced and to transmit this decision via path 110 to multiplexer 105 of FIG. 1.
Consider now in greater detail the operation of each block illustrated in FIG. 2 which is now given in terms of vector and matrix mathematics.
Averager 202, statistical calculator 203, and weights calculator 204 implement an improved EM algorithm similar to that suggested in the article by N. E. Day entitled "Estimating the Components of a Mixture of Normal Distributions", Biometrika, Vol. 56, no. 3, pp. 463-474, 1969. Utilizing the concept of a decaying average, classifier averager 202 calculates the average for the classifiers for the present and previous frames by calculating follcwing equations 1, 2, and 3: n =n+1 if n 2000 (1) z 1/n (2) WO 88/07738 PCT/US88/00030 Xn Xn- 1 zxn Xn is a vector representing the classifiers for the present frame, and n is the number of frames that have been processed up to 2000. z represents the decaying average coefficient, and Xn represents the average of the classifiers over the present and past frames. Statistical calculator 203 is responsive to receipt of the z, xn and Xn information to calculate the covariance matrix, T, by first calculating the matrix of sums of squares and products, Qn, as follows: Qn Qn-i z xn 'n After Qn has been calculated, T is calculated as follows: T Qn X X'l The means are subtracted from the classifiers as follows: xn xn Xn Next, calculator 203 determines the probability that the frame represented by the present vector xn is unvoiced by solving equation 7 shown below where, advantageously, the components of vector a are initialized as follows: component corresponding to log of the speech energy equals 0.3918606, component corresponding to log of the LPC gain equals -0.0520902, component corresponding to log area ratio of the first reflection coefficient equals 0.5637082, and component corresponding to squared correlation coefficient equals 1.361249; and b initially equals -8.36454: I WO 88/07738 PCT/US88/00030 1 P(u I xn) 1 exp(a'xn+b) After solving equation 7, calculator 203 determines the probability that the classifiers represent a voiced frame by solving the following: P(v I Xn)= 1-P(u I xn) Next, calculator 203 determines the overall probability that any frame will be unvoiced by solving equation 9 for pn: Pn Pn-1 z P(u I n) After determining the probability that a frame will be unvoiced, calculator 203 then determines two vectors, u and v, which give the mean values of each classifier for both unvoiced and voiced type frames. Vectors u and v are the statistical averages for unvoiced and voiced frames, respectively. Vector u, statistical average unvoiced vector, contains the mean values of each classifier if a frame is unvoiced; and vector v, statistical average voiced vector, gives the mean value for each classifier if a frame is voiced. Vector u for the present frame is solved by calculating equation 10, and vector v is determined for the present frame by calculating equation 11 as follows: Un U, 1 Z Xn P(u I Xn)/Pn ZXn Vn v, 1 z xn P(v IXn)/(l-Pn) Zxr (11) Calculator 203 now communicates the u and v vectors, T matrix, and probability p to weights calculator 204 via path 212.
L- WO 88/07738 PCT/US88/00030 -9- Weights calculator 204 is responsive to this information to calculate new values for vector a and scalar b. These new values are then transmitted back to statistical calculator 203 via path 213. This allows detector 103 to adapt rapidly to changing environments. Advantageously, if the new values for vector a and scalar b are not transmitted back to statistical calculator 203, detector 103 will continue to adapt to changing environments since vectors u and v are being updated. As will be seen, determinator 205 uses vectors u and v as well as vector a and scalar b to make the voicing decision. If n is greater than advantageously 99, vector a and scalar b are calculated as follows. Vector a is determined by solving the following equation:
T-
1 (Vn-un) a (12) l-pn(1-pn) (Un-Vn)' T (Un-Vn) Scalar b is determined by solving the following equation: -1 b a'(un+vn) log[(1-pn)/pn (13) 2 After calculating equations 12 and 13, weights calculator 204 transmits vectors a, u, and v to block 205 via path 214. If the frame contained silence only equation 6 is calculated.
Determinator 205 is responsive to this transmitted information to decide whether the present frame is voiced or unvoiced. If the element of vector S(vn un) corresponding to power is positive, then, a frame is declared voiced if the following equation is true: a'xn a'(Un+vn)/2 0 (14) or if the element of vector (Vn Un) corresponding to power is negative, then, a WO 88/07738 PCT/US88/00030 frame is declared voiced if the following equation is true: a'x, a'(un+Vn)/ 2 0 Equation 14 can also be rewritten as: a'xn b log[(1-pn)/Pn] 0 Equation 15 can also be rewritten as: a'xn b log[(1-pn)/pn] 0 If the previous conditions are not meet, determinator 205 declares the frame unvoiced. Equations 14 and 15 represent decision regions for making the voicing decision. The log term of the rewritten forms of equations 14 and 15 can be eliminated with some change of performance. Advantageously, in the present example, the element corresponding to power is the log of the speech energy.
Generator 206 is responsive to the information received via path 214 from calculator 204 to calculate the distance measure, A, as follows. First, the discriminant variable, d, is calculated by equation 16 as follows: d a'xn b log[(1-pn)/pn] (16) Advantageously, it would be obvious to one skilled in the art to use different types of voicing detectors to generate a value similar to d for use in the following equations. One such detector would be an auto-correlation detecto.. If the frame is voiced, the equations 17 through 20 are solved as follows: t WO 88/07738 PCT/US88/00030 11 mi mi zd sl si zd 2 and k s m k s mi (17) (18) (19) where mi is the mean for voiced frames and kl is the variance for voiced frames.
The probability, Pd, that determinator 205 will declare a frame unvoiced is calculated by the following equation: Pd Pd Advantageously, Pd is initially set to If the frame is unvoiced, equations 21 through 24 are solved as follows: mo mo zd (21) so=(1-z)so+zd 2 ,and (22) ~L PCT/US88/00030 WO 88/07738 12- 2 ko so mo (23) The probability, Pd, that determinator 205 will declare a frame unvoiced is calculated by the following equation: Pd Pd z (24) After calculating equation 16 through 22 the distance measure or merit value is calculated as follows: SPd (1 Pd) (ml m 0 2 (1 Pd)kl Pdko Equation 25 uses Hotelling's two-sample T 2 statistic to calculate the distance measure. For equation 25, the larger the merit value the greater the separation.
However, other merit values exist where the smaller the merit value the greater the separation. Advantageously, the distance measure can also be the Mahalanobis distance which is given in the following equation: 2 (ml mo) 2 (1 Pd)kl Pdko (26) Advantageously, a third technique is given in the following equation: -i WO 88/07738 PCT/US88/00030 -13- (ml mo)2
A
2 2 (i o)2 (27) (k 1 k o Advantageously, a fourth technique for calculating the distance measure is illustrated in the following equation:
A
2 a'(vn-un) (28) Discriminant detector 102 makes the unvoiced/voiced decision by transmitting information to multiplexer 105 via path 107 indicating a voiced frame if a'x b 0. If this condition is not true, then detector 102 indicates an unvoiced frame. The values for vector a and scalar b used by detector 102 are advantageously identical to the initial values of a and b for statistical voiced detector 103.
Detector 102 determines the distance measure in a manner similar to generator 206 by performing calculations similar to those given in equations 16 through 28.
Tn flow chart form, FIGS. 3 and 4 illustrate, in greater detail, the operations performed by statistical voiced detector 103 of FIG.2. Blocks 302 and 300 implement blocks 202 and 201 of FIG. 2, respectively. Blocks 304 through 318 implement statistical calculator 203. Blocks 320 and 322 implement weights calculator 204, and blocks 326 through 338 implement block 205 of FIG.2. Generator 206 of FIG. 2 is implemented by block 340. Subtractor 207 is implemented by block 308 or block 324.
Block 302 calculates the vector which represents the average of the classifiers for the present frame and all previous frames. Block 300 determines whether speech or silence is present in the present frame; and if silence is present in the present frame, the mean for each classifier is subtracted from each classifier by block 324 before control is transferred to decision block 326. However, if speech is present in the present frame, then the statistical and weights calculations are performed by blocks 304 through 322. First, the average vector is found in block 302. Second, the sums of the squares and products matrix is calculated in is unvoiced or not plus a threshold value. A weignhts calculator is responsive to tne iniormation generate in mte statistical calculations to generate a new set of values for the weights vector and the threshold value which are utilized by the statistical calculator during the next frame. An unvoiced/voiced determinator (205) then is reponsive to the two statistical average vectors and the weights vector to make the unvoiced/voiced decision.
WO 88/07738 PCT/US88/00030 14block 304. The latter matrix along with the vector X representing the mean of the classifiers for the present and past frames is then utilized to calculate the covariance matrix, T, in block 306. The mean X is then subtracted from the classifier vector Xn iin block 308.
Block 310 then calculates the probability that the present frame is unvoiced by utilizing the present weight vector a, the present threshold value b, and the classifier vector for the present frame, Xn. After calculating the probability that the. present frame is unvoiced, the probability that the present frame is voiced is calculated by block 312. Then, the overall probability, Pn, that any frame will be unvoiced is calculated by block 314.
Blocks 316 and 318 calculate two vectors: u and v. The values contained in vector u represent the statistical average values that each classifier would have if the frame were unvoiced. Whereas, vector v contains values representing the statistical average values that each classifier would have if the frame were voiced. The actual vectors of classifiers for the present and previous frames are clustered around either vector u or vector v. The vectors representing the classifiers for the previous and present frames are clustered around vector u if these frames are found to be unvoiced; otherwise, the previous classifier vectors are clustered around vector v.
After execution of blocks 316 and 318, control is transferred to decision block 320. If N is greater than 99, control is transferred to block 322; otherwise, control is transferred to block 326. Upon receiving control, block 322 then calculates a new weight vector a and a new threshold value b. The vector a and value b are used in the next sequential frame by the preceding blocks in FIG. 3. Advantageously, if N is required to be greater than infinity, vector a and scalar b will never be changed, and detector 103 will adapt solely in response to vectors v and u as illustrated in blocks 326 through 338.
Blocks 326 through 338 implement u/v determinator 205 of FIG. 2.
Block 326 determines whether the power term of vector v of the present frame is greater than or equal to the power term of vector u. If this condition is true, then decision block 328 is executed. The latter decision block determines whether the test for voiced or unvoiced is met. If the frame is found to be voiced in decision block 328, then the frame is so marked as voiced by block 330 otherwise the frame is marked as unvoiced by block 332. If the power term of vector v is less than the power term of vector u for the present frame, blocks 334 through 338 WO 88/07738 PCT/US88/00030 function are executed and function in a similar manner. Finally, block 340 calculates the distance measure.
In flow chart form, FIG. 5 illustrates, in greater detail the operations performed by block 340 of FIG. 4. Decision block 501 determines whether the frame has been indicated as unvoiced or voiced by examining the calculations 330, 332, 336, or 338. If the frame has been designated as voiced, path 507 is selected. Block 510 calculates probability Pd, and block 502 recalculates the mean, ml, for the voiced frames and block 503 recalculates the variance, k 1 for voiced frames. If the frame was determined to be unvoiced, decision block 501 selects path 508. Block 509 recalculates probability Pd, and block 504 recalculates mean, mo, for unvoiced frames, and block 505 recalculates the variance k 0 for unvoiced frames. Finally, block 506 calculates the distance measure by performing the calculations indicated.
It is to be understood that the afore-described embodiment is merely illustrative of the principles of the invention and that other arrangements may be devised by those skilled in the art without departing from the spirit and the scope of the invention. In particular, the calculations performed per frame or set could be performed for a group of frames or sets.
Claims (5)
16- The claims defining the invention are as follows: 1. An apparatus for determining the presence of a fundamental frequency in frames of non-training set speech, comprising: first means responsive to a set of classifiers defining speech attributes of a present one of said frames of non-training set speech and threshold values and sets of weights for previous frames for calculating a set r statistical distributions; second means responsive to the calculated set of statistical distributions for calculating a set of weights each associated with one of said classifiers for said present frame; and third means responsive to the calculated set of weights and classifiers and said set of statistical distributions for determining the presence of said :fundamental frequency in said present frame of non-training set speech. 2. The apparatus of claim 1 wherein said second means comprises means for calculating a threshold value in response to said set of said statistical distributions; and means for communicating said set of said weights and said threshold value to said first means to be used for calculating another set of statistical distributions for another one of said frames of speech. ago* 20 3. The apparatus of claim 2 wherein said first means further responsive to the communicated set of weights and another set of classifiers defining said *09 "speech attributes of said other one of said frames for calculating another set of statistical distributions. 4. The apparatus of claim 3 wherein said first means comprises means .i 25 for calculating the average of each of said classifiers over previous ones of said non-training set speech frames; and means responsive to said average ones of said classifiers for said previous ones of said non-training set speech frames and said communicated set of weights and said other set of classifiers for determining said other set of statistical distributions. The apparatus of claim 4 wherein said first means further comprises means for detecting the presence of speech in each of said frames; and VTI 17 means for inhibiting the the calculation of said other set of statistical distributions for said other one of said frames upon speech not being detected in said other one of said frames. 6. The apparatus of claim 5 wherein said first means further comprises means for calculating the probability that said other set of classifiers represents an unvoiced frame and the probability that said other set of classifiers represents a voiced frame; and means for calculating the overall probability that any frame is unvoiced. S 10 7. The apparatus of claim 6 wherein said first means further comprises means for calculating a set of statistical average classifiers by determining the 'average of the sets of classifiers representing an unvoiced frame and a set of statistical average classifiers by determining the average of the sets of classifiers representing a voiced frame. 8. The apparatus of claim 7 wherein said first means further comprises means for calculating a covariance matrix from said set of statistical average classifiers representing an unvoiced frame for said other one of said frames and said set of classifiers representing an unvoiced frame for said other one of said frames. *o* 9. The apparatus of claim 8 wherein said second means responsive to the covariance matrix and said sets of statistical average classifiers for both voiced and unvoiced frames and said overall probability for a frame being unvoiced for S.. determining said other set of statistical distributions. The apparatus of claim 9 wherein said third means responsive to said other set of statistical distributions and said sets of statistical average classifiers for unvoiced and voiced frames for determining the presence of said fundamental frequency in said other one of said frames. 11. A method for determining the presence of a fundamental frequency in frames of non-training set speech, comprising: Z) .sU v I ,j -18- calculatirg a set of statistical distributions in response to a set of classifiers defining speech attributes of a present one of said frames and threshold values and sets of weights for previous frames on non-training set speech; calculating a set of weights each associated with one of said classifiers in response to the calculated set of statistical distributions for said present frame; and determining the presence of said fundamental frequency in said present one of said frames of non-training set speech in response to the calculated set of weights and classifiers and said set of statistical distributions. 10 12. The method of claim 11 wherein said step of calculating said set of weights comprises the steps of calculating a threshold value in response to said set of statistical distributions; and •o communicating said set of said weights and said threshold value for o* use in calculating another set of statistical distributions for another one of said 15 frames of non-training set speech. 13. The method of claim 12 wherein said step of calculating said set *seeof statistical distributions further responsive to the communicated set of weights and another set of classifiers defining said speech attributes of said other one of S•said frames to calculate another set of statistical distributions. 20 14. The method of claim 13 wherein said step of calculating said set of statistical distributions further comprises the steps of calculating the average of each of said classifiers over previous ones of said of non-training set speech frames; and :calculating said other set of statistical distributions in response to said average ones of said classifiers for said previous ones of said of non-training set speech frames and said communicated set of weights and said other set of classifiers. The method of claim 14 wherein said step of calculating said set of statistical distributions further comprises the steps of detecting the presence of speech in each of said frames; and n nMLU .I U Ti -19- inhibiting the calculation of said other set of statistical distributions for said other one of said frames upon speech not being detected in said other one of said frames. 16. The method of claim 15 wherein said step of calculating said set of statistical distributions further comprises the steps for calculating the probability that said other set of classifiers represent an unvoiced frame and the probability that said other set of classifiers represent a voiced frame; and calculating the overall probability that any frame is unvoiced.
17. The method of claim 14 wherein said step of calculating said set 10 of statistical distributions further comprises the step of calculating a set of statistical average classifiers representing by determining the average of the sets of classifiers an unvoiced frame and a set of statistical average classifiers representing by determining the average of the sets of classifiers a voiced frame.
18. The method of claim 17 wherein said step of calculating said set of statistical distributions further comprises the step of calculating a covariance matrix from said set of statistical average classifiers representing an unvoiced frame for said other one of said frames and said set of classifiers representing an unvoiced frame for said other one of said frames.
19. An apparatus for determining the presence of a fundamental frequency in frames of non-trainingr set speech substantially as hereinbefore described with reference to the drawings.
20. A method for determining the presence of a fundamental frequency in frames of non-training set speech substantially as hereinbefore described with reference to the drawings. DATED this TWENTY-THIRD day of APRIL 1990 American Telephone and Telegraph Company Patent Attorneys for the Applicant SPRUSON FERGUSON
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US3429687A | 1987-04-03 | 1987-04-03 | |
US034296 | 1987-04-03 |
Publications (2)
Publication Number | Publication Date |
---|---|
AU1222688A AU1222688A (en) | 1988-11-02 |
AU599459B2 true AU599459B2 (en) | 1990-07-19 |
Family
ID=21875521
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU12226/88A Ceased AU599459B2 (en) | 1987-04-03 | 1988-01-12 | An adaptive multivariate estimating apparatus |
Country Status (9)
Country | Link |
---|---|
EP (1) | EP0308433B1 (en) |
JP (1) | JPH01502779A (en) |
AT (1) | ATE82426T1 (en) |
AU (1) | AU599459B2 (en) |
CA (2) | CA1337708C (en) |
DE (1) | DE3875894T2 (en) |
HK (1) | HK106693A (en) |
SG (1) | SG59893G (en) |
WO (1) | WO1988007738A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU602957B2 (en) * | 1987-04-03 | 1990-11-01 | American Telephone And Telegraph Company | Distance measurement control of a multiple detector system |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE3876569T2 (en) * | 1987-04-03 | 1993-04-08 | American Telephone & Telegraph | DETECTOR FOR TUNING LOUD WITH ADAPTIVE THRESHOLD. |
JP3277398B2 (en) * | 1992-04-15 | 2002-04-22 | ソニー株式会社 | Voiced sound discrimination method |
US6202046B1 (en) | 1997-01-23 | 2001-03-13 | Kabushiki Kaisha Toshiba | Background noise/speech classification method |
JP3670217B2 (en) | 2000-09-06 | 2005-07-13 | 国立大学法人名古屋大学 | Noise encoding device, noise decoding device, noise encoding method, and noise decoding method |
JP4517045B2 (en) * | 2005-04-01 | 2010-08-04 | 独立行政法人産業技術総合研究所 | Pitch estimation method and apparatus, and pitch estimation program |
CN104517614A (en) * | 2013-09-30 | 2015-04-15 | 上海爱聊信息科技有限公司 | Voiced/unvoiced decision device and method based on sub-band characteristic parameter values |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU1700788A (en) * | 1987-04-03 | 1988-11-02 | American Telephone And Telegraph Company | An adaptive threshold voiced detector |
AU1242988A (en) * | 1987-04-03 | 1988-11-02 | American Telephone And Telegraph Company | Distance measurement control of a multiple detector system |
-
1988
- 1988-01-12 JP JP62506332A patent/JPH01502779A/en not_active Withdrawn
- 1988-01-12 WO PCT/US1988/000030 patent/WO1988007738A1/en active IP Right Grant
- 1988-01-12 AT AT88901347T patent/ATE82426T1/en not_active IP Right Cessation
- 1988-01-12 AU AU12226/88A patent/AU599459B2/en not_active Ceased
- 1988-01-12 DE DE8888901347T patent/DE3875894T2/en not_active Expired - Lifetime
- 1988-01-12 EP EP88901347A patent/EP0308433B1/en not_active Expired - Lifetime
- 1988-02-29 CA CA000560109A patent/CA1337708C/en not_active Expired - Fee Related
-
1993
- 1993-05-07 SG SG598/93A patent/SG59893G/en unknown
- 1993-10-07 HK HK1066/93A patent/HK106693A/en not_active IP Right Cessation
-
1995
- 1995-03-09 CA CA000616983A patent/CA1338251C/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU1700788A (en) * | 1987-04-03 | 1988-11-02 | American Telephone And Telegraph Company | An adaptive threshold voiced detector |
AU1242988A (en) * | 1987-04-03 | 1988-11-02 | American Telephone And Telegraph Company | Distance measurement control of a multiple detector system |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU602957B2 (en) * | 1987-04-03 | 1990-11-01 | American Telephone And Telegraph Company | Distance measurement control of a multiple detector system |
Also Published As
Publication number | Publication date |
---|---|
WO1988007738A1 (en) | 1988-10-06 |
JPH0795237B1 (en) | 1995-10-11 |
AU1222688A (en) | 1988-11-02 |
CA1337708C (en) | 1995-12-05 |
ATE82426T1 (en) | 1992-11-15 |
EP0308433A1 (en) | 1989-03-29 |
EP0308433B1 (en) | 1992-11-11 |
DE3875894T2 (en) | 1993-05-19 |
DE3875894D1 (en) | 1992-12-17 |
SG59893G (en) | 1993-07-09 |
CA1338251C (en) | 1996-04-16 |
JPH01502779A (en) | 1989-09-21 |
HK106693A (en) | 1993-10-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2165229C (en) | Method and apparatus for characterizing an input signal | |
EP0625774B1 (en) | A method and an apparatus for speech detection | |
US20020165713A1 (en) | Detection of sound activity | |
JPH11510334A (en) | Assess signal quality | |
CN109448726A (en) | A kind of method of adjustment and system of voice control accuracy rate | |
US4937870A (en) | Speech recognition arrangement | |
US5046100A (en) | Adaptive multivariate estimating apparatus | |
JP5385876B2 (en) | Speech segment detection method, speech recognition method, speech segment detection device, speech recognition device, program thereof, and recording medium | |
AU599459B2 (en) | An adaptive multivariate estimating apparatus | |
US5007093A (en) | Adaptive threshold voiced detector | |
US4972490A (en) | Distance measurement control of a multiple detector system | |
FI111572B (en) | Procedure for processing speech in the presence of acoustic interference | |
EP0310636B1 (en) | Distance measurement control of a multiple detector system | |
EP0309561B1 (en) | An adaptive threshold voiced detector | |
CN112786068B (en) | Audio sound source separation method, device and storage medium | |
AU612737B2 (en) | A phoneme recognition system | |
Qian et al. | Sloclas: A database for joint sound localization and classification | |
Bertocco et al. | In-service nonintrusive measurement of noise and active speech level in telephone-type networks | |
Grimaldi | An improved procedure for QoS measurement in telecommunication systems | |
Metzger | Blind segmentation of a multi-speaker conversation using two different sets of features | |
Jebara | A voice activity detector in noisy environments using linear prediction and coherence method | |
Yamazaki et al. | An objective method for evaluating the quality of speech with code errors using pattern matching techniques | |
Verma et al. | AN AUTOMATIC SPEECH RECOGNITION APPROACH USING MODIFIED VOICE ACTIVITY DETECTION MECHANISM | |
Karpov et al. | Combining Voice Activity Detection Algorithms by Decision Fusion | |
Wei et al. | A pitch analysis technique for automated speech distortion identification in VoIP networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MK14 | Patent ceased section 143(a) (annual fees not paid) or expired |