CA1338251C - Adaptive multivariate estimating apparatus - Google Patents

Adaptive multivariate estimating apparatus

Info

Publication number
CA1338251C
CA1338251C CA000616983A CA616983A CA1338251C CA 1338251 C CA1338251 C CA 1338251C CA 000616983 A CA000616983 A CA 000616983A CA 616983 A CA616983 A CA 616983A CA 1338251 C CA1338251 C CA 1338251C
Authority
CA
Canada
Prior art keywords
frames
unvoiced
present
probability
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CA000616983A
Other languages
French (fr)
Inventor
David Lynn Thomson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
American Telephone and Telegraph Co Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by American Telephone and Telegraph Co Inc filed Critical American Telephone and Telegraph Co Inc
Priority to CA000616983A priority Critical patent/CA1338251C/en
Application granted granted Critical
Publication of CA1338251C publication Critical patent/CA1338251C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)
  • Feedback Control In General (AREA)
  • Paper (AREA)
  • Bridges Or Land Bridges (AREA)
  • Radar Systems Or Details Thereof (AREA)
  • Measuring Pulse, Heart Rate, Blood Pressure Or Blood Flow (AREA)
  • Measurement Of Radiation (AREA)
  • Medicines Containing Antibodies Or Antigens For Use As Internal Diagnostic Agents (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The present invention relates to an apparatus for determining the voicing decision for non-training set speech signals. The apparatus is comprised of a unit which is responsive to the non-training set speech signals for sampling the speech signals to produce digital speech signals, to form frames of the digital non-training set speech signals, and to process each frame to generate a set of classifiers defining speech attributes.
A unit is also provided for estimating statistical distributions for voiced and unvoiced frames without prior knowledge of the voicing decisions for past ones of the frames of digital non-training set speech. A unit is provided which is responsive to these statistical distributions for determining decision regions representing voiced and unvoiced digital non-training set speech. A unit is then provided which is responsive to the decision regions and a present one of the frames for making the voicing decisions. Finally, a unit is provided which is responsive to the determination of the voicing decision in the frame of the digital non-training set speech signals for transmitting a signal to a data unit for subsequent use in speech processing.

Description

AN ADAPTIVE MULTIVARIATE ESTIMATING APPARATUS
This is a division of co-pending Canadian Patent Application Serial No. 560,109 filed February 29, 1988.
Technical Field This invention relates to classifying samples representing a real time process 5 into groups with each group corresponding to a state of the real time process. In particular, the classifying is done in real time as each sample is generated using statistical techniques.
Background and Problem In many real time processes, a problem exists in attempting to estimate the 10 present state of the process in a ch~nging environment from present and past samples of the process. One example of such a process is the generation of speech by the human vocal tract. The sound produced by the vocal tract can have a fundamental frequency -voiced state or no fundamental frequency - unvoiced state. Further, a third state may exist if no sound is being produced - silence state. The problem of determining these three 15 states is referred to as the voicing/silence decision. In low bit rate voice coders, degradation of voice quality is often due to inaccurate voicing decisions. The difficulty in correctly making these voicing decisions lies in the fact that no single speech parameter or classifier can reliably distinguish voiced speech from unvoiced speech. In order to make the voicing decision, it is known in the art to combine multiple speech classifiers in the 20 form of a weighted sum. Such a method is illustrated in D.P. Prezas, et al., "Fast and Accurate Pitch Detection Using Pattern Recognition and Adaptive Time-Domain Analysis."
Proc. IEEE Int. Conf. Acoust., Speech and Signal Proc., Vol. 1, pp. 109-112, April, 1986.
As described in that article, a frame of speech is declared voiced if a weighted sum of speech classifiers is greater than a specified threshold; and unvoiced otherwise.
25 Mathematically, this relationship may be expressed as a'x + b > 0 where "a" is a vector comprising the weights, "x" is a vector comprising the classifiers, and "b" is a scalar representing the threshold value. The weights are chosen to maximize performance on a training set of speech where the voicing of each frame is known. These weights form a decision rule which provides significant speech qualitv improvements in speech coders 30 compared to those using a single parameter.
A problem associated with the fixed weighted sum method is that it does not perform well w-hen the speech environment changes. Such changes in ~the speech environment may be a result of a telephone conversation being carried J~--on in a car via a mobile telephone or maybe due to different telephone tr~ncmittçrs. The reason that the fixed weighted sum melhods do not perforn wellin changing environments is that many speech classifiers are influenced by background noise, non-linear distortion, and ~ltering. If voicina is to be 5 determined for speech with characteristics different from that of the training set, the weights, in general, will not yield satisfactory results.
One method for adapting the fixed weighted sum method to changing speech environment is disclosed in the paper of J. P. C~mpbell, et al., "~oiced/Unvoiced (~l~55ifir~t on of Speech with Application tO the U.S.
10 Governrnent LPC-lOE Algorithm," EEE International Conference on Acoustics, Speech and Signal Processing, 1986, Tokyo, ~ol. 9.11.4, pp. 473-476. This paper discloses the utilization of different sets of weights and threshold values each of which has been predeter nined from the sa ne set of training data with differentlevels of white noise being added to the training data for each set of weights and 15 threshold value. For each frarne, the speech samples are processed by a set of weights and a threshold value after the results of one of these sets is chosen on the basis of the value of a signal-to-noise-ratio, Sl~R. The range of possible values that the SNR can have is subdivided into subranges with each subrange being assigned to one of the sets. For each frame, the SNR is calc~ erl; the 20 subrange is ~etPrmined; and then, the detector associated with this subrange is used to ~et~rmine whether the frame is unvoicedlvoiced. The problem with this method is that it is only valid for the training data plus white noise and cannot adapt to a wide range of speech environmçntc and speal~ers. Therefore, there e~cists a need for a voiced detector that can reliably determine whether speech is 25 unvoiced or voiced for a varying environment and different speakers.
Solution ~ he above described problem is solved and a technical advance is achieved by an apparatus that is responsive tO reai time samples from a physicalprocess to deterrnine statistical distributions for plurality of process states and 30 from the those distributions to çst~blich decision regions. The latter regions are used to determine the present process state as each process sample is generated.For use in making a voicing decision, the apparatus adapts to a changing speech environment by utilizing the statistics of classifiers of the speech. Statistics are based on the cl~csifiçrs and are used to modify the decision regions used in the35 voicing decision. Advantageously, the apparatus estimates statistical distributions 13~8~1 for both voiced and unvoiced frames and uses those statistical distributions fordetermining decision regions. The latter regions are then used to deterrnine whether a present speech frame is voiced or unvoiced.
Advantageously, a voiced detector calculates the probability tha~ the 5 present speech frame is unvoiced, the probability that the present speech frarne is voiced, and an overall probability that any frame will be unvoiced. Using these three probabilities, the detector then c~le~ tes the probability distribution ofunvoiced frames and the probability distribution of voiced frames. In addition, the calculation for deterrnining the probability that the present speech frarne is voiced 10 or unvoiced is perforrned by doing a ~ um likelihood statistical operation.
Also, the maYimum likelihood statistical operation is responsive to a weight vector and a threshold value in addition to the probabilities. In another embodiment, the weight vector and threshold value are adaptively caiculated for each frame. Thisadaptive calculation of the weight vector and the threshold value allows the 15 detector to rapidly adapt to changing speech environm-ontc Advantageously, an apparatus for determining the presence of the filn~l~ment~l frequency in frames of speech has a circuit responsive to a set ofclassifiers representing the speech a~tributes of a speech frame for c~ l]~ting a set of st~ncnt~l parameters. A second circui~ is responsive to the calculated set of20 parameters defining the statistical distributions to calculate a set of weights each associated with one of the cl~s.sifi~rs. Finally, a third circuit in response to the c31culated set of weights and cl~ssifi~rs and the set of parameters determines the presence of the filntl~m~nt~l frequency in the speech frame or as it is co~ ollly expressed makes the unvoiced/voiced decision.
Advantageously, the second circuit also calculates a threshold value and a new weight vector and co""".l,-ic~tes these values to the first circuit that is responsive to these values and a new set of classifiers for det~rrnining another set of st~ti~tir~l parameters. This other set of statistic~l parameters is then used to determine the presence of the fnn~m~tal frequency for the next frarne of speech.Advantageously, the first circuit is responsive to the next set of classifiers and the new weight vector and threshold value to calculate the probability that the next frame is unvoiced, the probability that the next frame is voiced, and the overall probability that any frame will be unvoiced. rhese probabilities are then utilized with a set of values giving the average of classifiers 35 for past and present frames to determine the other set of statistical parameters.

~ 4 ~ 1338251 The method for determining a voicing decision is performed by the following steps: estim~ting statistical distributions for voiced and unvoiced frames, determining decision regions representing voiced and unvoiced speech in response to the statistical distributions, and making the voicing decision in response to the decision regions and a present speech frame. In addition, the statistical distributions are calculated from the probability that the present speech frame is unvoiced, the probability that the present speech frame is voiced, and the overall probability than any frame will be unvoiced. These three probabilities are calculated as three sub-steps of the step of determining the statistical distributions.
Brief Description of the Drawin~
The present invention, taken in conjunction with the invention disclosed in co-pending Canadian Patent Application Serial No. 560,109 will be described in detail hereinbelow with the aid of the accompanying drawings, in which:
FIG. 1 is a block diagram of an apparatus using the present irlvention;
FIG. 2 illustrates, in block diagram form, the present invention;
FIGS. 3 and ~ illustrate, in greater detail, the functions performed by statistical voiced detector 103 of FIG. 2; and FIG. 5 illustrates, in greater detail, functions performed by block 340 of FIG.
4.
Detailed Description FIG. 1 illustrates an apparatus for performing the unvoiced/voiced decision operation using as one of the voiced detectors a statistical voiced detector which is the subject of this invention. The apparatus of FIG. 1 utilizes two types of detectors: discriminant and statistical voiced detectors. Statistical voiced detector 103 is an adaptive detector that detects changes in the voice envirorlment and modifies the weights used to process classifiers coming from classifier generator 101 so as to more accurately make the unvoiced/voiced decision.
Discriminant voice detector 102 is utilized during initial start up or rapidly ch~nging voice environment conditions when statistical voice detector 103 has not yet fully adapted to the .
m1hal or new volce envlronment.
Consider now the overall operation of the apparatus illustrated in FIG. 1.
Classifier generator 101 is responsive to each frame of speech to generate classifiers which advantageously may be the log of the speech energy, the log of the LPC gain, the log area ratio of the first reflection coefficient, and the squared correlation coefficient of two speech segments one frame long which are offset by one pitch period. The calculation of these classifiers involves digitally sampling analog speech, forrning frarnes of the digital samples, and processing those frarnes and is well known in the art. Generator 101 transrnits the classifiers to detectors 102 and 103 via path 106.
Detectors 102 and 103 are responsive to the classifiers received via 5 path 106 to make unvoicedlvoiced decisions and transmit these decisions via paths 107 and 110, respectively, to multiplexer 105. In addition, the detectors determine a distance measure between voiced and unvoiced frarnes and transmit these distances via paths 108 and 109 to comp~r~tor 104. Advantageously, these distances may be Mahalanobis distances or other generalized distances.
10 Comparator 104 is responsive to the distances received via paths 108 and 109 to control multiplexer 105 so that the latter multiplexer selects the output of thedetector that is generating the largest distance.
FIG. 2 illustrates, in greater detail, statistical voiced detector 103. For each frame of speech, a set of classifiers also referred to as a vector of cl~ssifierc 15 is rec~ived via path 106 from classifier generator 101. Silence detector 201 is responsive to these classifiers to determine whether or not speech is present in the present frarne. If speech is present, detector 201 transmits a signal via path 210.
If no speech (silence) is present in the frame, then only subtractor 207 and U/Vdetermin~ror 205 are operational for that particular frame. Whether speech is 20 present or not, the unvoicedJvoiced decision is made for every frame by let~nT in~tor 205.
In response to the signal from detector 201, cl~ccifiPr averager 202 m~int~inc an average of the individual ci~csifi~rs received via path 106 by averaging in the cl~ccifi~r.c for the presenl frame with the c~ssifi~rs for previous frames. If speech (non-silence) is present in the frame, silence detector 201 signals st~tictic~l calculator 203, generator 206, and averager 202 via path 210.
St~nctic~l calculator 203 calculates st~ti~tic~l distributions for voiced and unvoiced frarnes. In particular, calculator 203 is responsive to the signal received via path 210 to calculate the overall probability that any frame is 30 unvoiced and the probability that any frame is voiced. In addition, statistical calculator 203 calculates the statistical value that each classifier would have if the frame was unvoiced and the statistical value that each cl~c.cifiçr would have if the frame was voiced. Further, calculator 203 calculates the covariance matrLx of the classifiers. Advantageously, that statistical value may be the mean. The 35 calculations performed by calculator 203 are not only based on the present frame - 6- 13~8251 but on previous frarnes as well. Statistical calculator 2û3 perforrns these calculations not only on the basis of the classifiers received for the present frarne via path 106 and the average of the classifiers received path 211 but also on the basis of the weight for each classifiers and a threshold value defining whether a 5 frame is unvoiced or voiced received via path 213 from weights calculator 204.Weights calculator 204 is responsive to the probabilities, covariance matrix, and St~ti.~ric~l values of the cl~c.sifiçrs for the present frarne as genera~ed by calcuiator 203 and received via path 212 to recalculate the values used as weight vector a, for each of the cl~s.sifitor~ and the threshold value b, for the 10 present frarne. Then, these new values of a and b are ~r~n.smir~l back to statistical calculator 203 via path 213.
Also, weights calculator 204 tr~n.~mir.~ the weights and the statistical values for the classifiers in both the unvoiced and voiced regions via path 214,determin~tor 2ûS, and path 208 to generator 206. The latter generator is 15 responsive to this information to c~lc~ tç the ~ist~nce measure which is subsequently tr~ncmitte~l via path 109 to comparator 104 as illustrated in FIG. 1.
U/V determin~or 205 is responsive to the information tr~n~mitteA. via paths 214 and 215 to determine whether or not the frame is unvoiced or voiced and to transrnit this decision via path 110 to multiplexer 105 of FIG. 1.
Consider now in greater detail the operation of each block illustlated in FIG. 2 which is now given in terms of vector and matrix mathematics.
Averager 202, statistical calculator 203, and weights calculator 204 implement an improved EM algolill~ similar to that suggested in the article by N. E. Day entitled "Estimating the Components of a Mixture of Normal Distributions", Biom.otrik~, Vol. 56, no. 3, pp. 463474, 1969. Utilizing the concept of a decaying average, çl~sifiçr aYerager 202 calculates the average for the classifiers for the present and previous frames by c~lcul~ting following equations 1, 2, and 3:

n = n+1 if n < 2û(}0 (1) z = 1/n (2) Xn = (1--z) Xn_l ~ ZYn (3) xn is a vector representing the c]~csifiers for the present frarne, and n is thenum~er of frarnes that have been processed up to 2000. z represents the decayingaverage coefficient, and Xn represents the average of the classifiers over the 5 present and past frames. Statistical calculator 203 is responsive to receipt of the z, xn and Xn information to calculate the covariance matrix, T, by first calclllating the matrix of sums of squares and products, Qn, as follows:

Qn = (1--Z) Qn-l + Z xn x n (4) After Qn has been calculated, T is calculated as follows:

T = Qn -- Xn X n (5) The means are subtracted from the cl~c.cifi~rs as follows:

xn = xn -- Xn (6) Next, calculator 203 determines the probability that the frame represented by the present vector xn is unvoiced by solving equation 7 shown below where, 15 advantageously, the components of vector a are init~1;7efl as follows: component corresponding to log of the speech energy equals 0.3918606, component corresponding to log of the LPC gain equals -0.0520902, component corresponding to log area ratio of the first reflection coefficient equals 0.5637082, and component corresponding to squared correlation coefficient equals 1.361249;
20 and b initially equals -8.36454:

- 8- 13382~1 P( I ) 1 (7) After solving equation 7, calculator 203 detP,lTr2ines the probabili~y that the cl~csifi~rs represent a voiced frame by solving the following:

P~v I xn) = 1--P~u I xn) (8) 5 Next, calculator 203 determines the overall probability that any frame will be unvoiced by solving equation 9 for Pn:

Pn = (1--Z) Pn-l + Z P~u I xn) . (9) After determining the probability that a fr~me will be unvoiced, c~ 2tor 203 then dete~nines two vectors, u and v, which give the mean values 10 of each cl~s.sifier for ~oth unvoiced and voiced type frames. Vectors u and v are the statistical averages for unvoiced and voiced frames, respectively. Vector u,statistical average unvoiced vector, contains the mean va,ues of each cl~csifier if a frame is unvoiced; and vector v, sr~tictic~l average voiced vector, gives the mean value for each cl~csifi~r if a frame is voiced. Vector u for the present frame is 15 solved by calculating equation lO, and vector v is determined for the present frame by c~k2ll~ting equation 11 as folows:

Un = (1--z) Un_l + Z Xn P(ulxn)/pn ~ ZXn (10) Vn = (1--z) Vn_l + Z Xn P(vlxn)l(l--Pn) ~ zxn (11) C~lculator 203 now commllnicates the u and v vectors, T matnx, and probability p20 to weights calculator 204 via path 212.

- 9 - 13~82Sl Weights calculator 204 is responsive to this information to calculate new values for vector a and scalar b. These new values are then tr~ncmitted backto statistical calculator 203 via path 213. This allows detector 103 to adapt rapidly to changing environments. Advantageously, if the new values for vector a5 and scalar b are not tr~n~mitted bac~c to st~ncr~ calculator 203, detector 103 will continue to adapt to changing environments since vectors u and v are being updated. As will be seen, det~-~Tnin~tor 205 uses vectors u and v as well as vector a and scalar b to make the voicing decision. Lf n is g,reater than advantageously 99, veclor a and scalar b are calculated as follows. Vector a is 10 dete~mined by solving the following equation:

a = 1 ( 1--p ) (u --vn)' ~1 (un--vn) ( 12) Scalar b is determined by solving the following equation:

b = 2 a'(un+Yn) + log[(1--Pn)/Pn ] (13) After calculating equations 12 and 13, weights calculator 204 transmits vectors a, 15 u, and v to block 205 via path 214. If the frame contairled silence only equation 6 is calc~ t~
Det~ min~tor 205 is responsive to this tr~n~mitte~l information to decide whether the present frame is voiced or unvoiced. If the ç3em~nt of vector(vn - un) corresponding to power is positive, then, a frame is declared voiced if 20 the fo31Owing equation is true:

a Xn ~ a (Un+Vn)/2 > 0; (14) or if the element of vector (vn - un) correspon~ing to power is negative, then, a o 1338251 frame is declared voiced if the following equation is tn~e:

a xn ~ a (un+vn)/2 < 0 . (15) Equation 14 can also be rewritten as:

a'xn + b--log~ pn)/pn] > 0 5 Equation 15 can also be rewritten as:

a'xn + b--log~ pn)/pn] < -If the previous conditions are not meet, ~et~rnin~tor 205 declares the frame unvoiced. Equations 14 and 15 represent decision regions for making the voicing ~ci~ion The log term of the rewritten forms of equations 14 and 15 can be 10 elimin~te~l with some change of pelrv...~nce Advantageously, in the present example, the elem~nt cu~l~syonding to power is the log of the speech energy.
Generator 206 is responsive to the information received via path 214 from calculator 204 to calculate the distance measure, A, as follows. First, thediscrimin~nt variable, d, is c~lc~ ted by equation 16 as follows:

d = a'xn + b--logL(1--Pn)/Pn] (16) Advantageously, it would be obvious to one sl~ilIed in the art to use different types of voicing detectors to generate a value similar to d for use in the following equations. One such detector would be an auto-correlation detector. If the frameis voiced, the equatiûns 17 through 20 are solved as follows:

m~ z) ml + zd, (17) s~ z) sl + zd2, and (18) kl = Sl - m~l (19) where ml is the mean for voiced frames and kl is the variance for voiced frames.The probability, Pd, that ~e~e~min~tor 205 will declare a frame unvoiced is calculated by the following equation:

Pd = (1--z) Pd (20) Advantageously, Pd is initially set to .5.
If the frame is unvoiced, equa~ ons 21 through 24 are solved as 10 follows:

mO = (1-z) mO + zd, (21) sO=~ l-z)so+zd2, and (22) ko = sO - m2 . (23) The probability, Pd, that determin~rQr 205 will declare a frarne unvoiced is calcula~ed by the following equation:

Pd = (1--z) Pd + z (24) S After calculating e~uation 16 through 22 the distance measure or merit value is calculated as follows:

A2 _ Pd (1 Pd) (ml --mO)2 (25 (1 --Pd)kl + Pd4, Equation 25 uses HotelLing's two-sample T2 statistic to s~ls~ te the distance measure. For e~uation 25, the larger the merit value the greater the separation.10 However, other merit values exist where the smaller the merit value the greater the separation. Advantageously, the distance measure can also be the Mahalanobisdistance which is given in the following equation:

(1 --Pd)kl + Pdko (26) Advantageously, a third technique is given in the following equation:

A2 = 2 (k ko) (27) Advantageously, a fourth technique for calculating the distance measure is illustrated in the following equation:

A2 = a~(vn--Un) (28) S Discrimin~nt det~ctor 102 makes the unvoiced/voiced decision by tr~ncmitting inform~tion to multiplexer 105 via path 107 in~ ting a voiced fra_eif a'~c + b > 0. If this con~ition is not true, then detector 102 inrlic~tes an unvoiced frame. The values for vector a and scalar b used by ~etect-~r 102 are advantageously identical to the initial values of a and b for statistical voiced10 detector 103.
Detector 102 det~rrnines the distance measure in a manner similar to generator 206 by pe~rulll.ing calc~ tionc similar to those given in equations 16through 28.
In flow chart form, FlGS. 3 and 4 illustrate, in greater detail, the 15 operations pc.rù~ed by statistical voiced detector 103 of FIG.2. Blocks 302 and 300 implement blocks 202 and 201 of FI&. 2, respectively. B1Ocks 304 through 318 implement statistical calculator 203. Blocks 320 and 322 implement weights e~ tor 204, and blocks 326 through 338 implement block 205 of FIG.2. Gerl~ 206 of FIG. 2 is implemented by block 340. Subtractor 207 is 20 impl~ tcd by block 308 or block 324.
Block 302 calculates the vector which represents the average of the cl~ccifi~rs for the present frame and all previous frames. Bloc~ 300 ~eterrnineswhether speech or silence is present in the present frame; and if silence is present in the present frame, the mean fûr each cl~ccifier is subtracted frûm each classifier 25 by block 324 before control is transferred to decision block 326. Howelrer, if speech is present in the present frame, then the statistical and weights calculations are performed by blocks 304 through 322. First, the average vector is found in bloc~ 302. Second, the sums of the squares and products matTix is calculated in - 14- 13382~1 block 304. The latter matrix along with the vector X representing the mean of the classi_ers for the present and past frames is then utilized to calculate the covariance matrix, T, in block 306. The mean X is then subtracted from the cl~scifier vector xn in block 308.
Block 310 then calculates the probability that the present frarne is unvoiced by utilizing the present weight vector a, the present threshold value b, and the ~l~c$ifi~or vector for the present frarne, xn. After calculating the probability that the present frame is unvoiced, the probability that the presentframe is voiced is calculated by block 312. Then, the overall probability, Pn, that 10 any frame will be unvoiced is calculated by block 314.
Bloclcs 316 and 318 calculate two vectors: u and v. The values contained in vector u represent the St~ti~tic~i average values that each classifier would have if the frame were unvoiced. Whereas, vector v contains values re~l~senting the statistical average values that each cl~csifi~or would have if the 15 frarne were voiced. The actual vectors of classifiers for the present and previous frarnes are clustered around either vector u or vector v. The vectors representing the ~l~csifiers for the previous and present frarnes are clustered around vector u if these frames are found to be unvoiced; otherwise, ~he previous c~csifier vectorsare clustered around vector v.
After execution of blocks 316 and 318, control is transferred to decicion block 320. If N is greater than 99, control is transferred to block 322;
otherwise, control is transferred to block 32~. Upon receiving control, block 322 then calculates a new weight vector a and a new threshold value b. The vector a and value b are used in the next se~uential frame by the preceding blocks in 25 FIG. 3. Advantageously, if N is required to be greater than infinity, vector a and scalar b will never be changed, and detector 103 will adapt solely in response to vectors v and u as illustrated in blocks 326 through 338.
Blocks 326 through 338 implement u/v determin~tnr 205 of FIG. 2.
Block 326 determines whether the power term of vector v of the present frame is 30 greater than or e~ual to the power term of vector u. If this condition is true, then decision block 328 is executed. The latter decision block ~e~e~ines whether the test for voiced or unvoiced is met. If the frame is found to be voiced in decision block 328, then the frame is so marked as voiced by block 330 otherwise the frame is marlced as unvoiced by block 332. Lf the power teIm of vector v is less35 than the power term of vector u for the present frame, blocks 334 through 338 function are executed and function in a sirnilar manner. Finally, block 340 calculates the ~ict~nre measure.
In flow chart form, FIG. 5 illustrates, in greater detail the operations p~.ro~ed by block 340 of FIG. 4. Decision block 501 determines whether the S frame has been infiic~te~ as unvoiced or voiced by e~mining the calculations 330, 332, 336, or 338. If the frame has been ~esign~ed as voiced, path 507 is selected. Block 510 c~ tes probability Pd, and bloc~ 502 recalculates the mean, ml, for the voiced frames and bloclc 503 recalculates the variance, kl, for voiced frames. If the f~ame was ~iet~rrnine~ to be unvoiced, decision block 501 10 selects path 508. Bloc~ 509 recalculates probability Pd, and block 504 reC~i~ul~t~s mean, mO, for unvoiced frames, and block 505 recalculates the variance ko for unvoiced frames. Finally, blocl; 506 calculates the distance measure by p~rul~lg the c~lc~ tionc infiiC~tçn It is to be understood that the afore-described embodiment is merely 15 illustrative of the principles of the invention and that other a~an~,c;me~ may be devised by those skilled in the art without deparnng from the spirit and the scope of the invention. In particular, the calculations performed per frame or set could be p~ ulmed for a group of f~mes or sets.

Claims (10)

Claims:
1. An apparatus for determining the voicing decision for non-training set speech signals comprising:
means responsive to said non-training set speech signals for sampling said speech signals to produce digital speech signals, to form frames of said digital non-training set speech signals, and to process each frame to generate a set of classifiers defining speech attributes;
means for estimating statistical distributions for voiced and unvoiced frames without prior knowledge of the voicing decisions for past ones of said frames of digital non-training set speech;
means responsive to said statistical distributions for determining decision regions representing voiced and unvoiced digital non-training set speech;
means responsive to said decision regions and a present one of said frames for making the voicing decision; and means responsive to the determination of said voicing decision in said frame of said digital non-training set speech signals for transmitting a signal to a data unit for subsequent use in speech processing.
2. The apparatus of claim 1 wherein said estimating means comprises means responsive to said present and past ones of said frames for calculating the probability that said present one of said frames is voiced;
means responsive to said present and past ones of said frames for calculating the probability that said present one of said frames is unvoiced;
means responsive to said present and past ones of said frames and said probability that said present one of said frames is unvoiced for calculating the overall probability that any frame will be unvoiced;
means responsive to said probability that said present one of said frames is voiced and said overall probability for calculating the probability distribution of voiced ones of said frames; and means responsive to said probability that said present one of said frames is unvoiced and said overall probability for calculating the probability distribution of unvoiced ones of said frames.
3. The apparatus of claim 2 wherein said means for calculating said probability that said present one of said frames is unvoiced performs a maximum likelihood statistical operation.
4. The apparatus of claim 3 wherein said means for calculating said probability that said present one of said frames is unvoiced further responsive to a weight vector and a threshold value to perform said maximum likelihood statistical operation.
5. The apparatus of claim 2 wherein said means for determining said decision regions comprises means responsive to said present and past ones of said frames for calculating covariance; and means responsive to said covariance for generating said decision region representing said unvoiced speech.
6. A method for determining the voicing decision for non-training set speech signals, comprising the steps of:
sampling said speech signal to produce digital non-training set speech signals, to form frames of said digital non-training set speech signals, and to process each frame to generate a set of classifiers defining speech attributes;
estimating statistical distributions for voiced and unvoiced frames without prior knowledge of the voicing decisions for previous ones of said frames of digital non-training set speech;
determining decision regions representing voiced and unvoiced speech in response to said statistical distributions; and making the voicing decision in response to said decision regions and a present one of said frames; and transmitting a signal to data unit for subsequent use in speech processing in response to the determination of said voicing decision in said frame of said digital non-training speech signals.
7. The method of claim 6 wherein said estimating step comprises the steps of calculating the probability that said present one of said frames is voiced in response to said present and past ones of said frames;
calculating the probability that said present one of said frames is unvoiced in response to said present and past ones of said frames of non-training set speech;
calculating the overall probability that any frame will be unvoiced in response to said present and past ones of said frames and said probability that said present one of said frames is unvoiced;
calculating the probability distribution of voiced ones of said frames in response to said probability that said present one of said frames is voiced and said overall probability; and calculating the probability distribution of unvoiced ones of said frames in response to said probability that said present one of said frames is unvoiced and said overall probability.
8. The method of claim 7 wherein said step of calculating said probability that said present one of said frames in unvoiced performs a maximum likelihood statistical operation.
9. The method of claim 8 wherein said step of calculating said probability that said present one of said frames is unvoiced further responsive to a weight vector and a threshold value to perform said maximum likelihood statistical operation.
10. The method of claim 7 wherein said step of determining said decision regions further responsive to said overall probability for determining said decision region representing said unvoiced speech.
CA000616983A 1987-04-03 1995-03-09 Adaptive multivariate estimating apparatus Expired - Fee Related CA1338251C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA000616983A CA1338251C (en) 1987-04-03 1995-03-09 Adaptive multivariate estimating apparatus

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US3429687A 1987-04-03 1987-04-03
US034,296 1987-04-03
CA000560109A CA1337708C (en) 1987-04-03 1988-02-29 Adaptive multivariate estimating apparatus
CA000616983A CA1338251C (en) 1987-04-03 1995-03-09 Adaptive multivariate estimating apparatus

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CA000560109A Division CA1337708C (en) 1987-04-03 1988-02-29 Adaptive multivariate estimating apparatus

Publications (1)

Publication Number Publication Date
CA1338251C true CA1338251C (en) 1996-04-16

Family

ID=21875521

Family Applications (2)

Application Number Title Priority Date Filing Date
CA000560109A Expired - Fee Related CA1337708C (en) 1987-04-03 1988-02-29 Adaptive multivariate estimating apparatus
CA000616983A Expired - Fee Related CA1338251C (en) 1987-04-03 1995-03-09 Adaptive multivariate estimating apparatus

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CA000560109A Expired - Fee Related CA1337708C (en) 1987-04-03 1988-02-29 Adaptive multivariate estimating apparatus

Country Status (9)

Country Link
EP (1) EP0308433B1 (en)
JP (1) JPH01502779A (en)
AT (1) ATE82426T1 (en)
AU (1) AU599459B2 (en)
CA (2) CA1337708C (en)
DE (1) DE3875894T2 (en)
HK (1) HK106693A (en)
SG (1) SG59893G (en)
WO (1) WO1988007738A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU598933B2 (en) * 1987-04-03 1990-07-05 American Telephone And Telegraph Company An adaptive threshold voiced detector
JP3277398B2 (en) * 1992-04-15 2002-04-22 ソニー株式会社 Voiced sound discrimination method
US6202046B1 (en) 1997-01-23 2001-03-13 Kabushiki Kaisha Toshiba Background noise/speech classification method
JP3670217B2 (en) 2000-09-06 2005-07-13 国立大学法人名古屋大学 Noise encoding device, noise decoding device, noise encoding method, and noise decoding method
JP4517045B2 (en) * 2005-04-01 2010-08-04 独立行政法人産業技術総合研究所 Pitch estimation method and apparatus, and pitch estimation program
CN104517614A (en) * 2013-09-30 2015-04-15 上海爱聊信息科技有限公司 Voiced/unvoiced decision device and method based on sub-band characteristic parameter values

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU598933B2 (en) * 1987-04-03 1990-07-05 American Telephone And Telegraph Company An adaptive threshold voiced detector

Also Published As

Publication number Publication date
DE3875894D1 (en) 1992-12-17
DE3875894T2 (en) 1993-05-19
ATE82426T1 (en) 1992-11-15
SG59893G (en) 1993-07-09
EP0308433B1 (en) 1992-11-11
AU599459B2 (en) 1990-07-19
AU1222688A (en) 1988-11-02
WO1988007738A1 (en) 1988-10-06
JPH0795237B1 (en) 1995-10-11
EP0308433A1 (en) 1989-03-29
HK106693A (en) 1993-10-15
CA1337708C (en) 1995-12-05
JPH01502779A (en) 1989-09-21

Similar Documents

Publication Publication Date Title
CA2165229C (en) Method and apparatus for characterizing an input signal
CA1123955A (en) Speech analysis and synthesis apparatus
Rix et al. Perceptual Evaluation of Speech Quality (PESQ) The New ITU Standard for End-to-End Speech Quality Assessment Part I--Time-Delay Compensation
EP0776567B1 (en) Analysis of audio quality
EP1083542B1 (en) A method and apparatus for speech detection
EP1160772A2 (en) Multisensor based acoustic signal processing
NZ228290A (en) Voice activity detector by spectrum comparison
US9786300B2 (en) Single-sided speech quality measurement
Enqing et al. Voice activity detection based on short-time energy and noise spectrum adaptation
JP2000099080A (en) Voice recognizing method using evaluation of reliability scale
US5046100A (en) Adaptive multivariate estimating apparatus
EP0685835B1 (en) Speech recognition based on HMMs
CA1338251C (en) Adaptive multivariate estimating apparatus
US5007093A (en) Adaptive threshold voiced detector
US4972490A (en) Distance measurement control of a multiple detector system
JP4673828B2 (en) Speech signal section estimation apparatus, method thereof, program thereof and recording medium
EP0309561B1 (en) An adaptive threshold voiced detector
CA1336212C (en) Distance measurement control of a multiple detector system
de Abreu et al. Regression-Based Noise Modeling for Speech Signal Processing
Cauchi NON-INTRUSIVE QUALITY EVALUATION OF SPEECH PROCESSED IN NOISY AND REVERBERANT ENVIRONMENTS
Bertocco et al. In-service nonintrusive measurement of noise and active speech level in telephone-type networks
Yamazaki et al. An objective method for evaluating the quality of speech with code errors using pattern matching techniques
Moulsley et al. An adaptive voiced/unvoiced speech classifier.
Grimaldi An improved procedure for QoS measurement in telecommunication systems
Kaleka Effectiveness of Linear Predictive Coding in Telephony based applications of Speech Recognition

Legal Events

Date Code Title Description
MKLA Lapsed