WO2001013360A1 - Pitch and voicing estimation for low bit rate speech coders - Google Patents

Pitch and voicing estimation for low bit rate speech coders Download PDF

Info

Publication number
WO2001013360A1
WO2001013360A1 PCT/CA2000/000364 CA0000364W WO0113360A1 WO 2001013360 A1 WO2001013360 A1 WO 2001013360A1 CA 0000364 W CA0000364 W CA 0000364W WO 0113360 A1 WO0113360 A1 WO 0113360A1
Authority
WO
WIPO (PCT)
Prior art keywords
peak
signal
separation distances
peak separation
voiced
Prior art date
Application number
PCT/CA2000/000364
Other languages
French (fr)
Inventor
Bhaskar Bhattacharya
Original Assignee
Glenayre Electronics, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Glenayre Electronics, Inc. filed Critical Glenayre Electronics, Inc.
Priority to AU36512/00A priority Critical patent/AU3651200A/en
Publication of WO2001013360A1 publication Critical patent/WO2001013360A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals

Definitions

  • the invention provides an improved method of estimating the speech sound pitch and voicing parameters used by low bit rate speech coders.
  • Low bit rate coders operate by processing estimates of speech sound pitch and voicing. These estimates should preferably be highly accurate, but prior art techniques have yielded relatively inaccurate pitch and voicing estimates. One reason for this is the fact that pitch changes constantly, making it difficult to reliably estimate pitch at any particular instant. Another reason is that voiced speech sound is not perfectly periodic, and the degree of aperiodicity varies both from sound to sound and from speaker to speaker.
  • the prior art commonly uses auto-correlation techniques to detect signal waveform similarities, such as the signal peaks which characterize voiced speech sounds.
  • the time (i.e. hori- zontal) axes of two identical copies of the waveform segment of interest are incrementally repositioned with respect to each other, and the two waveform segments are auto-correlated at each repositioning.
  • the auto-correlation value is 1.
  • the auto-correlation value is less than 1 , because the two waveform segments do not precisely match one another in such other positions.
  • the two waveform segments should "almost” match one another when they are repositioned so as to offset corresponding (but aperiodic) signal peaks by an integer number of cycles.
  • the prior art approach involves detection of periodicity in auto-correlation maxima.
  • the present invention processes the time domain signal characteristics of voiced speech sounds to provide pitch and voicing estimates of improved accuracy. Aperiodic signal components are reduced by filtering, and the signal peaks which characterize voiced speech sounds are enhanced, improving the reliability with which the signal peaks can be detected.
  • the average distance between adjacent signal peaks within a "window" containing several such peaks can then be measured, in accordance with the invention, so as to determine an average pitch value with greater confidence.
  • the invention provides a method of transforming a speech signal segment s(n) into a signal r ⁇ Q) having a plurality of substantially equal magnitude peaks, with each adjacent pair of peaks separated by a distance corresponding to one pitch cycle of s(ri) if s( ⁇ ) is voiced. This is achieved by first filtering s( ) to remove high frequency signal components therefrom and to produce a filtered replica thereof. The filtered replica is then magnitude expanded (preferably, cubed) to produce a magnitude expanded and filtered signal x(n). The largest magnitude signal peak within x(n) is then located, and a template y( ⁇ ) comprising a portion of x(n) containing the largest magnitude signal peak is derived. y(n) is then cross-correlated across x n) to produce r ⁇ ).
  • the invention also provides a method of estimating a speech sound voicing parameter v(m) and a speech sound pitch parameter pirn) characterizing a speech signal s( ⁇ ). This is achieved by first filtering s(ri) to remove high frequency signal components therefrom and to produce a filtered replica thereof. The filtered replica is then magnitude expanded (preferably, cubed) to produce a magnitude expanded and filtered signal x(n). x(n) is then transformed into a signal r ⁇ ) having a plurality of substantially equal magnitude peaks with each adjacent pair of peaks separated by a distance corresponding to p(m) if s(n) is voiced.
  • r ⁇ ) and a predefined peak detection threshold are compared to detect the aforementioned peaks, and the peak separation distance between each adjacent pair of detected peaks is determined. If the peak separation distances are substantially equal to one another, then/?(m) is set equal to the average of the peak separation distances and v(m) is set to indicate that s(n) is voiced. If the peak separation distances are not substantially equal to one another, and if the peak detection threshold has not been decreased by a predefined amount, then the peak detection threshold is decreased and the method is repeated, commencing with the detection threshold comparison. If the peak separation distances are not substan- tially equal to one another, and if the peak detection threshold has been decreased by the predefined amount, then v(m) is set to indicate that s(n) is not voiced.
  • the transformation of x(n) may be performed by locating the largest magnitude signal peak within x(n), deriving a template y(n) comprising a portion of x(n) containing the largest magnitude signal peak, and then cross-correlating y n) with x(n) to produce r ⁇ ik).
  • the r_ y (A:):peak detection threshold comparison may be performed by locating, within r ⁇ Q ), each signal peak having a peak magnitude value exceeding a first peak threshold detection value.
  • the peak separation distance determination may be performed by determining a set of first peak separation distances between each adjacent pair of signal peaks having a peak magnitude value exceeding the first peak threshold detection value. If the first peak separation distances differ from one another by an amount less than or equal to a selected maximum value, p(m) is set equal to the average of the first peak separation distances and v( ) is set to indicate that s(n) is voiced.
  • the method is repeated, commencing with locating, within r ⁇ ), each signal peak having a peak magnitude value exceeding a second peak threshold detection value, less than the first peak threshold detection value.
  • the peak separation distance determination may then include determining of a set of second peak separation distances between each adjacent pair of signal peaks having a peak magnitude value exceeding the second peak threshold detection value. If the second peak separation distances differ from one another by an amount less than or equal to the selected maximum value, p(m) is set equal to the average of the second peak separation distances and v( ) is set to indicate that s(n) is voiced.
  • the threshold detection value is decreased and the method is repeated, commencing with locating, within r ⁇ k), each signal peak having a peak magnitude value exceeding a third peak threshold detection value, less than the second peak threshold detection value.
  • the peak separation distance determination may then include determining of a set of third peak separation distances between each adjacent pair of signal peaks having a peak magnitude value exceeding the third peak threshold detection value. If the third peak separation distances differ from one another by an amount less than or equal to the selected maximum value, p(m) is set equal to an average of the third peak separation distances and v(m) is set to indicate that s(n) is voiced. If the first, second and third peak separation distances differ from one another by an amount greater than the selected maximum value, v( ) is set to indicate that s(n) is not voiced.
  • the invention is also directed to an electronic signal in a low bit rate speech coder.
  • the signal comprises a filtered, magnitude expanded and transformed replica r ⁇ ) of a speech signal _?( «).
  • the replica has a plurality of substantially equal magnitude peaks, with each adjacent pair of peaks separated by a distance corresponding to one pitch cycle of s( ) if s(n) is voiced.
  • the replica is used by the speech coder to derive a speech sound voicing parameter v(m) characterizing s(n) and a speech sound pitch parameter p(m) characterizing s(ri).
  • r ⁇ k is derived by cross-correlating a filtered and expanded replica x(n) of s(ri) with a template y(n) comprising a portion of x( ) containing a largest magnitude signal peak of x(n).
  • Figure 1 is a signal waveform segment depicting 200 samples of voiced speech sampled at 8 KHz.
  • FIGS. 2A and 2B together comprise a flowchart illustrating the basic methodology of the invention.
  • Figure 3 is a graph depicting the frequency response characteristic of the low pass filter used in the preferred embodiment of the invention.
  • Figure 4A is a graph depicting an original voiced speech signal waveform segment, before processing in accordance with the present invention.
  • Figure 4B depicts the signal waveform segment of Figure 4 A after low pass filtration and expanding in accordance with the invention.
  • Figure 4C depicts a normalized cross-correlation sequence obtained by cross-correlating the Figure 4B signal waveform segment with a largest-peak-containing portion thereof in accordance with the invention.
  • Figures 5 A through 5L are flowcharts which collectively embody a method of estimating the speech sound pitch and voicing parameters in accordance with the present invention. Description
  • processing begins at block 300 by initializing various parameters.
  • a speech fame containing speech samples is then retrieved (block 302).
  • the speech frame is preprocessed (block 304) by low pass filtration to remove high frequency signal components and by expanding (i.e. cubing) to enhance the peak portions of the signal.
  • the energy ratio of the preprocessed and original speech frame signals is then determined (block 306), and the maximum magnitude attained by the original speech frame signal is determined (block 308).
  • a broad initial classification based upon the energy ratio and maximum magnitude values is then made (block 310), to determine (block 312) whether the current frame appears to be voiced or unvoiced. If the current frame is classified as unvoiced (i.e.
  • the preprocessed signal is inverted (block 320), if necessary, to orient the signal with its largest peak positive-going. This simplifies peak location, aiding in determination of periodicity and thus pitch.
  • a portion of the preprocessed signal containing the aforementioned largest peak is extracted to form a template.
  • the template is cross-correlated (block 324) across the entire frame to yield a signal ( Figure 4C) which is expected to have peaks of substantially equal magnitude. Processing then continues at point "a" ( Figure 2B).
  • processing continues (block 326) by detecting peaks within the cross-correlated signal, using three separate peak detection thresholds. The distances between adjacent pairs of peaks detected in block 326 are determined (block 328). These distances are representative of the speech frame's pitch period.
  • a test (block 330) is performed to determine whether the candidate pitch period values determined in block 328 are approximately equal to one another. If the answer is "yes" ('Y'), then the frame's pitch period value is set (block 332) equal to the average of the candidate pitch period values determined in block 328, a pitch value confidence flag is set to indicate a high degree of confidence in the pitch period value so determined, and the frame's voicing value variable is set to reflect the fact that the frame is voiced.
  • the pitch value is checked to determine whether it reflects multiple or sub-multiple pitch values, and any such aberrations are removed if detected. Processing then continues at point "b" ( Figure 2 A) as previously explained.
  • block 330 test is answered “no" ('N') then a further test (block 336) is performed to determine whether some peaks may not have been detected by the aforementioned block 326 processing. If the block 336 test is answered “yes” ('Y') then the peak detection threshold(s) are lowered (or other criteria peak detection criteria are relaxed, as hereinafter explained) and processing continues at point “a” as previously explained. If the block 336 test is answered “no" ('N') then a further test (block 338) is performed to determine whether the frame's signal energy is changing relatively quickly.
  • a voicing value v( ) l corresponds to voiced speech, for which the estimated pitch value p(m) is meaningful.
  • a test is performed to determine whether a speech fame containing speech samples is available for processing. If the answer is "no" ('N'), processing stops (block 14). If the answer is "yes" ('Y'), frame counter m is incremented by one and the special voicing and pitch value confidence flags f ⁇ f ⁇ are each initialized to zero (block 16). Abnormalities such as croaking by the speaker may cause signal aberrations such as peak-to-peak interval spacings which exceed the peak-to-peak interval spacings which characterize the speaker's normal voiced speech pitch range.
  • the input speech signal segment s(n) (i.e. a "speech frame", for example, a signal consisting of speech sampled at 8 Khz, as depicted in Figure 4 A) is low pass filtered to remove high frequency signal components.
  • Figure 3 depicts the frequency response characteristic of a suitable low pass filter having a cutoff frequency of 500 Hz ("LPF" in block 18).
  • LPF cutoff frequency
  • the low pass filtered signal is then expanded by cubing (" ⁇ 3 " in block 18) it to enhance (i.e. amplify) the peak portions of the signal, relative to the non-peak signal portions, as seen in Figure 4B.
  • Squaring the low pass filtered signal would adequately enhance the signal peaks, but cubing preserves the negative-going signal portions, and is therefore preferred.
  • the low pass filtered, cubed signal is designated x n), where n is the sample number.
  • the invention estimates an average value of pitch at a particular time instant by defining a "window" (hereafter “frame") centred at that time instant. All signal peaks of complete signal cycles included within the frame are examined so as to identify those cycles. The interval length of each such cycle is determined. The average interval length for all complete signal cycles included within the frame is then determined. The average interval length value so determined is the average pitch estimate.
  • the energy ratio e r of the two signals s(n) and x ⁇ ) is then determined as indicated in block 20.
  • the value ⁇ is arbitrarily small (i.e.
  • the maximum magnitude s max attained by the speech signal s(n) throughout the frame is then determined at block 22.
  • the absolute value of s(n) is used to make this determination, because s(ri) may attain its maximum magnitude while negative.
  • e r values are generally characteristic of voiced speech
  • lower e r values are generally characteristic of unvoiced speech
  • higher s ⁇ values are generally characteristic of voiced speech
  • lower j m ⁇ c values are generally characteristic of unvoiced speech.
  • the current frame is also tentatively characterized (block 24) as unvoiced if s ⁇ ⁇ is less than a predefined constant value MAX UVLEVEL characteristic of unvoiced speech and e r is less than a predefined constant value ERATIOMIN_V characteristic of unvoiced speech. Otherwise, the current frame is tentatively characterized as voiced, and processing continues at point "B" ( Figure 5B). If the current frame is tentatively characterized as unvoiced, as explained above, then processing continues at point "C" ( Figure 51).
  • processing at block 24 results in the current frame being tentatively characterized as voiced. Processing accordingly continues at point "B" (block 28, Figure 5B) by testing to determine whether the value of x(n) with the largest magnitude is positive (greater than zero). If the answer is "no" ('N'), then ⁇ (n) is inverted in block 30. The object is to orient x(ri) so that its largest peak is positive-going. This simplifies location of such peak, which aids in determining periodicity and thus pitch.
  • y( ) serves as a template in the cross-correlation performed in block 34. Specifically, y(n) is cross-correlated across the entire frame to yield r ⁇ ) which contains a plurality of substantially equal magnitude peaks, a representative example of which is depicted in Figure 4C together with three predefined peak threshold values
  • PEAK THRESH1, PEAK THRESH2, and PEAK THRESH3 which are employed as hereinafter explained.
  • the value n pl is assigned a value equal to the number of signal peaks in r_ y ( ) having a magnitude exceeding PEAK_THRESH1.
  • Multiple peak-to-peak intervals with peaks exceeding PEAK_THRESH1 facilitate reliable determination of signal period, and hence pitch. If similar interval widths can be derived for a suitable number of adjacent intervals then the average width of such intervals can be accepted as the pitch estimate with reasonably high confidence in the accuracy of such estimate.
  • i pl is assigned (block 36) a vector value equal to the positions of those signal peaks in r ⁇ ik) having a magnitude exceeding PEAK_THRESH1.
  • a test is performed to determine whether n pl ⁇ 2. If n pl ⁇ 2, then r ⁇ (£) does not contain at least two peaks having a magnitude exceeding PEAK_THRESH1, making it impossible to determine any peak-to-peak interval width for r ⁇ ). In such case, the current frame is characterized as unvoiced (block 40) by zeroing the voicing value v( ) and the pitch value p(m), and processing then continues at point "D" ( Figure 5G), as hereinafter explained.
  • r ⁇ ik contains at least two peaks having a magnitude exceeding PEAK_THRESH1, facilitating determination of peak-to-peak interval width(s) for r_ y (£), as indicated in block 42.
  • the peak-to-peak interval width(s) /?l(£) are determined for all adjacent signal peaks in r ⁇ Qc) having a magnitude exceeding PEAK_THRESH1.
  • the variations ⁇ p ,(/) between those interval width(s) are determined for all such adjacent signal peaks in r ⁇ c).
  • Pitch values which characterize normal human speech can vary widely.
  • the present invention is directed to low bit rate speech coders, which do not require accurate determination of particularly high or low pitch values, since such values do not significantly affect the speech coding quality of such coders.
  • it is relatively difficult to accurately determine particularly high or low pitch values. Accordingly, when processing continues at point "D" ( Figure 5G), a test is performed (block 54) to determine whether the pitch value p(m) is particularly high or low (i.e. exceeds the predefined constant MAXP or is exceeded by the predefined constant MINP).
  • a further test (block 62) is performed to determine whether the current frame is characterized as unvoiced (like point “D", point “C” can be reached by following a number of different paths along which the current frame may have been characterized as unvoiced prior to reaching point “C”). If the answer to this further test is “no” ('N'), then processing continues at point “L” ( Figure 5K), as hereinafter explained. If the answer to this further test is "yes” ('Y'), then the current values of the variables L nadi andp old are saved (block 64) in the variables L ttad ,_, p old .
  • pitch doubling is characterized by a reduction in the magnitude of every other peak in the speaker's speech sounds. This can result in incorrect determination of the pitch of such speech sounds as double the correct pitch value.
  • every other peak may be excluded from the peaks used to determine peak-to- peak interval length (and hence pitch) if the magnitude of every other peak does not exceed the threshold value used to identify the peaks.
  • pitch halving can result in incorrect determination of pitch as one-half the correct value
  • pitch thirding can result in incorrect determination of pitch as one-third the correct value
  • pitch quartering can result in incorrect determination of pitch as one-quarter the correct value.
  • a pitch quartering test (block 74) is performed to determine whether the absolute value of the past value of voiced pitch p past less four times the pitch value p(m) is less than or equal to the value of the variable DPMAX2. If the answer is "yes” ('Y'), then it is concluded that the current frame is characterized by the pitch quartering phenomenon and the phenomenon's effect is removed by quadrupling the pitch value p(m) (block 76). Processing then continues at point "N" ( Figure 5J), as hereinafter explained.
  • a pitch thirding test (block 78) is performed to determine whether the absolute value of the past value of voiced pitch p pas( less three times the pitch value p(m) is less than or equal to the value of the variable DPMAX2. If the answer is "yes" ('Y'), then it is concluded that the current frame is characterized by the pitch thirding phenomenon and the phenomenon's effect is removed by trebling the pitch value p ⁇ m) (block 80). Processing then continues at point "N" ( Figure 5J), as hereinafter explained.
  • pitch thirding test (block 78) answer is "no" ('N')
  • processing continues at point "M" ( Figure 5J) with a pitch halving test (block 82) to determine whether the absolute value of the past value of voiced pitch p pasl less twice the pitch value p(m) is less than or equal to the value of the variable DPMAX2. If the answer is "yes” ('Y'), then it is concluded that the current frame is characterized by the pitch halving phenomenon and the phenomenon's effect is removed by doubling the pitch value p ⁇ m) (block 84).
  • pitch halving test (block 82) answer is "no" ('N'), or after doubling of the pitch value p(m) (block 84), processing continues at point "N", with a pitch doubling test (block 86) to determine whether the pitch value exceeds 120, and whether the absolute value of the past value of voiced pitch p past less half the pitch value p ⁇ m) is less than or equal to the value of the variable DPMAX2, and whether the pitch value confidence flag ⁇ is not set. If the answer is "yes" ('Y') in all three cases, then it is concluded that the current frame is characterized by the pitch doubling phenomenon and the phenomenon's effect is removed by halving the pitch value p(m) (block 88).
  • a further test is performed to determine whether the value of L nack (the current length of the unbroken sequence of voiced frames) is greater than or equal to a predefined constant MINTRLEN. This constant fixes at 3 the number of voiced frames which must occur in unbroken sequence before the variable p old is updated. If the block 90 test answer is "yes" ('Y'), then/? oW is updated (block 92) by assigning p old a value equal to the average of the pitch values determined for the current frame and the immediately preceding two frames.
  • the current value of the variable p v _ is stored in the variable p v _ (block 94).
  • a test (block 96) is then performed to determine whether the pitch value pirn) is less than or equal to the value of the predefined maximum quantized pitch constant MAXP2 (which is initialized at 147). Pitch values exceeding 147 are rare, so pitch values determined to exceed 147 are of questionable reliability. This is recognized by bypassing block 98, in which the pitch value p(m) is stored in the variable p v _, if the block 96 test reveals a pitch value exceeding 147. Processing then continues at point "L" ( Figure 5K).
  • a test is performed (block 100) to determine whether the voicing transient flag ⁇ a ⁇ has been set.
  • the objective of the above-described processing in blocks 100-108 is to set or clear the voicing transient flag to facilitate correction of v( ) if a transient occurrence is detected, such as a single voiced frame occurring in the midst of a series of unvoiced frames. Whenever a voicing transition occurs (i.e. from voiced to unvoiced, or vice versa), the voicing transient flag is set to reflect such change and denote a possible transient occurrence. If the voicing transient flag is already set when processing reaches block 100, and if the current and immediately preceding frames have the same voicing classification (i.e. both voiced, or both unvoiced), then it is concluded that a valid (i.e.
  • the voicing transient flag is cleared (block 104). But, if the voicing transient flag is not set when processing reaches block 100, and if the current and immediately preceding frames have different voicing classifications, then it is concluded that a new and possibly transient voicing transition has occurred; hence the voicing transient flag is set (block 108).
  • a test is performed (block 110) to determine whether the current frame is characterized as unvoiced. If the answer is "no" ('N'), then processing continues at point “O” ( Figure 5L), as hereinafter explained. If the answer is "yes” ('Y'), then the variables p old . and L ⁇ . are re-initialized to zero (block 112). Processing then continues at point "O" ( Figure 5L), as will now be explained.
  • variable DPMAX determines the maximum allowable pitch variation between successive cycles.
  • L t is set to either 50, or twice the integer part of 30% of the previously determined voiced pitch value p v _, whichever is greater.
  • L frame is then set to the value of the parameter BASELEN plus the updated value of L r
  • i p2 is assigned (block 134) a vector value equal to the positions of those signal peaks in r ⁇ ) having a magnitude exceeding PEAK_THRESH2.
  • the peak-to-peak interval widths, p2(k) are determined (block 136) for all adjacent peaks in r ⁇ ( ) having a magnitude exceeding PEAK_THRESH2.
  • r ⁇ contains more than two peaks having a magnitude exceeding PEAK_THRESH2.
  • any one of the detected intervals is a pitch period: (i) all of the detected intervals are of approximately identical width (i.e. the block 140 test outcome is "Yes"), in which case the width of each detected interval is a pitch period; or, (ii) some peaks remain undetected because they do not exceed PEAK_THRESH2, in which case some of the interval widths are equal to multiples of actual pitch periods within some small variation.
  • the block 150 test detects the latter possibility.
  • processing at block 46 reveals that the maximum variation in interval width ⁇ p , is not less than or equal to the maximum allowable pitch variation between successive cycles (i.e. DPMAX).
  • Processing accordingly continues at block 156 by assigning the parameter n p3 a value equal to the number of signal peaks in r ⁇ ik) having a magnitude exceeding PEAK_THRESH3.
  • n p3 a value equal to the number of signal peaks in r ⁇ ik) having a magnitude exceeding PEAK_THRESH3.
  • i p3 is assigned (block 156) a vector value equal to the positions of those signal peaks in r ⁇ (£) having a magnitude exceeding PEAK_THRESH3.
  • the test performed in block 160 is answered “no" ('N'), or if the test performed in block 162 reveals that the positions of the peaks in r ⁇ c) whose magnitude exceeds PEAK_THRESH1 do not coincide with the positions of the peaks in r ⁇ ik) whose magnitude exceeds PEAK_THRESH3, then the maximum and minimum interval width values max(pl), min(pl) are saved z.s p ⁇ max , pl ⁇ n respectively (block 166).
  • a "no" answer to the block 160 or 162 tests implies that some peaks detected using PEAK_THRESH3 were not detected using PEAK_THRESH1. In such case, the largest interval detected using PEAK_THRESH1 may comprise multiple pitch periods.
  • a test is accordingly performed (block 168) to determine whether any sub- multiple of the maximum interval width value less the minimum interval width value, is less than or equal to DPMAX2. If the answer is "no" ('N'), then the largest interval detected using PEAK_THRESH1 does not consist of multiple pitch periods, and the frame is characterized as unvoiced (block 170) by zeroing the voicing value v( ) and the pitch value pirn). If the answer is "yes" ('Y') then the largest interval detected using PEAK_THRESH1 most probably does consist of multiple pitch periods, and the frame is characterized as voiced with pitch value p(m) equal to the weighted average of the largest and the smallest intervals, pl ⁇ and /?” render usefulness. Processing then continues at point "D" ( Figure 5G), as previously explained.
  • n p3 a value equal to the number of signal peaks in r ⁇ ik) having a magnitude exceeding PEAK_THRESH3.
  • n p3 6 for the example shown in Figure 4C.
  • i p3 is assigned (block 182) a vector value equal to the positions of those signal peaks in r ⁇ k) having a magnitude exceeding PEAK_THRESH3.
  • the peak-to-peak interval width(s) p3(k) are determined (block 184) for all adjacent signal peaks in r ⁇ (£) having a magnitude exceeding PEAK_THRESH3. Then, the variations ⁇ p3 (/) between those interval width(s) are determined for all such adjacent signal peaks in r ⁇ (£).
  • a test is performed (block 186) to determine whether the number of signal peaks in r ⁇ ) having a magnitude exceeding PEAK_THRESH3, less the number of signal peaks in r ⁇ k) having a magnitude exceeding PEAK_THRESH1 exceeds 1 (i.e. n p3 -n pl > ⁇ ). If the answer is "no" ('N'), then processing continues at point "H" ( Figure 5E), as hereinafter explained.
  • the block 188 test determines that the maximum variation in interval width ⁇ p3 is not less than or equal to the maximum allowable pitch variation between successive cycles then the pitch values determined in respect of all signal peaks in r ⁇ (£) having a magnitude exceeding PEAK_THRESH3 are examined (block 192) to identify any case in which the absolute value of the difference between any two such pitch values is less than the value of the variable DPMAX.
  • a test (block 194) is then performed to determine whether any such case has been identified. If the answer is "no" ('N'), then the current frame is characterized (block 196) as unvoiced by zeroing the voicing value v( ) and the pitch value p(m) and then continuing processing at point "D" ( Figure 5G), as previously explained.
  • ⁇ p is assigned (block 198) as the ratio of the pitch value defined by the peak-to-peak interval width between the first two adjacent signal peaks in r ⁇ ) having a magnitude exceeding PEAK_THRESH1 , to the average of the two pitch values identified in block 192.
  • Peaks exceeding the PEAK_THRESH3 threshold may not be reliable indicators of pitch since the PEAK_THRESH3 threshold is relatively low. Further testing is required to verify that large peak-to- peak intervals identified via the PEAK_THRESH3 threshold are reliable indicators of pitch.
  • a test is accordingly performed (block 212) to determine whether any sub-multiple of the maximum interval width value less the minimum interval width value, is less than or equal to DPMAX2. If the answer is "no" ('N'), the intervals identified via the PEAK_THRESH3 threshold are not pitch intervals.
  • the current frame is therefore characterized as unvoiced (block 214) by zeroing the voicing value v( ) and the pitch value p(m). Processing then continues at point "D" ( Figure 5G), as previously explained.
  • the large interval identified via the PEAK_THRESH3 threshold most probably is a pitch multiple.
  • the pitch value confidence flag p is set (block 220) to indicate a high degree of confidence in the reliability of the pitch value assigned in block 216. Processing then continues at point "D" ( Figure 5G), as previously explained.
  • processing at block 58 reveals that the current frame has already been characterized as unvoiced.
  • a test (block 224) is then performed to determine whether one or more signal peaks in r ⁇ ) have a magnitude exceeding PEAK_THRESH1 (i.e. n pl > ⁇ ). If the answer is "no" ('N'), then processing continues at point "J" ( Figure 5H), as hereinafter explained.
  • a further test is then performed to determine whether the pitch value p(m) is outside the allowable pitch value range defined by the MINP and MAXP parameters. If the answer is "no" ('N'), then processing continues at block 244, as hereinafter explained. If the answer is "yes” ('Y'), then such out-of-range pitch values are ignored by characterizing the current frame as unvoiced (block 242), by zeroing the voicing value v( ) and the pitch value p m). Processing then continues at block 244 (which is also reached when processing continues at point "J", as previously mentioned) by determining the maximum value -? m ⁇ c attained by the speech signal s(k) within two sub-frames centred on the current frame.
  • a test (block 246) is performed to determine whether s, ⁇ exceeds the maximum allowable signal magnitude for unvoiced sounds (MAX_UVLEVEL) and is also lower than the maximum allowable signal magnitude for voiced sounds (MAX_VLEVEL). If the answer is "yes" ('Y'), a further test (block 248) is performed to determine whether more than one peak in r ⁇ ) exceeds PEAK_THRESH1 in magnitude.
  • a still further test is made to determine whether the absolute value of the current frame's pitch value (i.e. pirn)) less that of the frame which precedes the immediately preceding frame (i.e. p(m-2)) is less than 1.5 times the value of the variable DPMAX. If the block 264 test is answered “no" ('N'), then processing continues at point “O” ( Figure 5L), as previously explained. If the block 264 test is answered "yes" ('Y'), then the immediately preceding frame is re-characterized as voiced (block 266) by setting its voicing value (i.e.
  • a test (block 270) is then performed to determine whether the variable p old (which represents the average pitch value for an unbroken sequence of voiced frames) exceeds its initial value of zero. If the answer is "yes" ('Y'), then the pitch value p(m-l) of the immediately preceding frame is reset (block 272) to the value fp old .

Abstract

A speech signal segment s(n) in filtered (304) to remove high frequency signal components. The corresponding filtered replica is magnitude expanded (304) to produce a signal x(n). The largest magnitude signal peak within x(n) is located (308), and template y(n) comprising a portion of x(n) containing the largest magnitude signal peak is derived (322). y(n) is cross-correlated (324) across x(n) to produce r xy (k). The invention facilitates estimation of speech sound voicing and pitch parameters v(m), p(m) which characterize s(n) r xy (k) can be compared (326) with a threshold value to detect the peaks, and the separation distance between each adjacent pair of detected peaks determined (328). If the peak separation distances are substantially equal, p(m) is set equal to the average peak separation distance (322) and v(m) is set to indicate s (n) is voiced. Else if the peak detection threshold has not decreased by a predefined amount, then the peak detection threshold is decreased and the method is repeated.

Description

PITCH AND VOICING ESTIMATION FOR LOW BIT RATE SPEECH CODERS
Technical Field The invention provides an improved method of estimating the speech sound pitch and voicing parameters used by low bit rate speech coders.
Background Voiced speech sounds (sounds produced when the vocal cords vibrate, e.g. all vowel sounds) are almost periodic in nature, as illustrated in Figure 1. The period of such sounds, at any particular instant, is called the pitch.
Low bit rate coders operate by processing estimates of speech sound pitch and voicing. These estimates should preferably be highly accurate, but prior art techniques have yielded relatively inaccurate pitch and voicing estimates. One reason for this is the fact that pitch changes constantly, making it difficult to reliably estimate pitch at any particular instant. Another reason is that voiced speech sound is not perfectly periodic, and the degree of aperiodicity varies both from sound to sound and from speaker to speaker.
The prior art commonly uses auto-correlation techniques to detect signal waveform similarities, such as the signal peaks which characterize voiced speech sounds. For example, the time (i.e. hori- zontal) axes of two identical copies of the waveform segment of interest are incrementally repositioned with respect to each other, and the two waveform segments are auto-correlated at each repositioning. When the two copies are repositioned such that they precisely match (i.e. overlap) one another the auto-correlation value is 1. At all other positions the auto-correlation value is less than 1 , because the two waveform segments do not precisely match one another in such other positions. However, the two waveform segments should "almost" match one another when they are repositioned so as to offset corresponding (but aperiodic) signal peaks by an integer number of cycles. Thus, the prior art approach involves detection of periodicity in auto-correlation maxima. This entails difficulties which are overcome by the present invention. The present invention processes the time domain signal characteristics of voiced speech sounds to provide pitch and voicing estimates of improved accuracy. Aperiodic signal components are reduced by filtering, and the signal peaks which characterize voiced speech sounds are enhanced, improving the reliability with which the signal peaks can be detected. The average distance between adjacent signal peaks within a "window" containing several such peaks can then be measured, in accordance with the invention, so as to determine an average pitch value with greater confidence.
Summary of Invention
The invention provides a method of transforming a speech signal segment s(n) into a signal r^Q) having a plurality of substantially equal magnitude peaks, with each adjacent pair of peaks separated by a distance corresponding to one pitch cycle of s(ri) if s(ή) is voiced. This is achieved by first filtering s( ) to remove high frequency signal components therefrom and to produce a filtered replica thereof. The filtered replica is then magnitude expanded (preferably, cubed) to produce a magnitude expanded and filtered signal x(n). The largest magnitude signal peak within x(n) is then located, and a template y(ή) comprising a portion of x(n) containing the largest magnitude signal peak is derived. y(n) is then cross-correlated across x n) to produce r^ ).
The invention also provides a method of estimating a speech sound voicing parameter v(m) and a speech sound pitch parameter pirn) characterizing a speech signal s( ϊ). This is achieved by first filtering s(ri) to remove high frequency signal components therefrom and to produce a filtered replica thereof. The filtered replica is then magnitude expanded (preferably, cubed) to produce a magnitude expanded and filtered signal x(n). x(n) is then transformed into a signal r^ ) having a plurality of substantially equal magnitude peaks with each adjacent pair of peaks separated by a distance corresponding to p(m) if s(n) is voiced. r^ ) and a predefined peak detection threshold are compared to detect the aforementioned peaks, and the peak separation distance between each adjacent pair of detected peaks is determined. If the peak separation distances are substantially equal to one another, then/?(m) is set equal to the average of the peak separation distances and v(m) is set to indicate that s(n) is voiced. If the peak separation distances are not substantially equal to one another, and if the peak detection threshold has not been decreased by a predefined amount, then the peak detection threshold is decreased and the method is repeated, commencing with the
Figure imgf000005_0001
detection threshold comparison. If the peak separation distances are not substan- tially equal to one another, and if the peak detection threshold has been decreased by the predefined amount, then v(m) is set to indicate that s(n) is not voiced.
The transformation of x(n) may be performed by locating the largest magnitude signal peak within x(n), deriving a template y(n) comprising a portion of x(n) containing the largest magnitude signal peak, and then cross-correlating y n) with x(n) to produce r^ik).
The r_y(A:):peak detection threshold comparison may be performed by locating, within r^Q ), each signal peak having a peak magnitude value exceeding a first peak threshold detection value. The peak separation distance determination may be performed by determining a set of first peak separation distances between each adjacent pair of signal peaks having a peak magnitude value exceeding the first peak threshold detection value. If the first peak separation distances differ from one another by an amount less than or equal to a selected maximum value, p(m) is set equal to the average of the first peak separation distances and v( ) is set to indicate that s(n) is voiced.
If the first peak separation distances differ from one another by an amount greater than the selected maximum value, the method is repeated, commencing with locating, within r^ ), each signal peak having a peak magnitude value exceeding a second peak threshold detection value, less than the first peak threshold detection value. The peak separation distance determination may then include determining of a set of second peak separation distances between each adjacent pair of signal peaks having a peak magnitude value exceeding the second peak threshold detection value. If the second peak separation distances differ from one another by an amount less than or equal to the selected maximum value, p(m) is set equal to the average of the second peak separation distances and v( ) is set to indicate that s(n) is voiced. If the second peak separation distances differ from one another by an amount greater than the selected maximum value, the threshold detection value is decreased and the method is repeated, commencing with locating, within r^k), each signal peak having a peak magnitude value exceeding a third peak threshold detection value, less than the second peak threshold detection value. The peak separation distance determination may then include determining of a set of third peak separation distances between each adjacent pair of signal peaks having a peak magnitude value exceeding the third peak threshold detection value. If the third peak separation distances differ from one another by an amount less than or equal to the selected maximum value, p(m) is set equal to an average of the third peak separation distances and v(m) is set to indicate that s(n) is voiced. If the first, second and third peak separation distances differ from one another by an amount greater than the selected maximum value, v( ) is set to indicate that s(n) is not voiced.
The invention is also directed to an electronic signal in a low bit rate speech coder. The signal comprises a filtered, magnitude expanded and transformed replica r^ ) of a speech signal _?(«). The replica has a plurality of substantially equal magnitude peaks, with each adjacent pair of peaks separated by a distance corresponding to one pitch cycle of s( ) if s(n) is voiced. The replica is used by the speech coder to derive a speech sound voicing parameter v(m) characterizing s(n) and a speech sound pitch parameter p(m) characterizing s(ri). r^k) is derived by cross-correlating a filtered and expanded replica x(n) of s(ri) with a template y(n) comprising a portion of x( ) containing a largest magnitude signal peak of x(n).
Brief Description of Drawings Figure 1 is a signal waveform segment depicting 200 samples of voiced speech sampled at 8 KHz.
Figures 2A and 2B together comprise a flowchart illustrating the basic methodology of the invention.
Figure 3 is a graph depicting the frequency response characteristic of the low pass filter used in the preferred embodiment of the invention.
Figure 4A is a graph depicting an original voiced speech signal waveform segment, before processing in accordance with the present invention. Figure 4B depicts the signal waveform segment of Figure 4 A after low pass filtration and expanding in accordance with the invention. Figure 4C depicts a normalized cross-correlation sequence obtained by cross-correlating the Figure 4B signal waveform segment with a largest-peak-containing portion thereof in accordance with the invention. Figures 5 A through 5L are flowcharts which collectively embody a method of estimating the speech sound pitch and voicing parameters in accordance with the present invention. Description
Introduction An overview of the basic methodology of the invention will first be provided with reference to Figures 2 A and 2B. A detailed description of a preferred embodiment of the invention with then be provided with reference to Figures 5A-5L.
Referring to Figure 2A, processing begins at block 300 by initializing various parameters. A speech fame containing speech samples is then retrieved (block 302). The speech frame is preprocessed (block 304) by low pass filtration to remove high frequency signal components and by expanding (i.e. cubing) to enhance the peak portions of the signal. The energy ratio of the preprocessed and original speech frame signals is then determined (block 306), and the maximum magnitude attained by the original speech frame signal is determined (block 308). A broad initial classification based upon the energy ratio and maximum magnitude values is then made (block 310), to determine (block 312) whether the current frame appears to be voiced or unvoiced. If the current frame is classified as unvoiced (i.e. if the block 312 test is answered "yes" ('Y')) then the frame's voicing and pitch value variables are zeroed and various parameters are updated (block 314). A test (block 316) is then performed to detect and reverse a possible change in voicing value (i.e. from voiced to unvoiced, or vice versa) due to a mere transient occurrence such as a single voiced frame occurring in the midst of a series of unvoiced frames. After further updating of parameters, processing of the next frame then proceeds, commencing at point "z", as described above.
If the current frame is classified as voiced (i.e. if the block 312 test is answered "no" ('N')) then the preprocessed signal is inverted (block 320), if necessary, to orient the signal with its largest peak positive-going. This simplifies peak location, aiding in determination of periodicity and thus pitch. At block 322, a portion of the preprocessed signal containing the aforementioned largest peak is extracted to form a template. The template is cross-correlated (block 324) across the entire frame to yield a signal (Figure 4C) which is expected to have peaks of substantially equal magnitude. Processing then continues at point "a" (Figure 2B).
At point "a" (Figure 2B) processing continues (block 326) by detecting peaks within the cross-correlated signal, using three separate peak detection thresholds. The distances between adjacent pairs of peaks detected in block 326 are determined (block 328). These distances are representative of the speech frame's pitch period. A test (block 330) is performed to determine whether the candidate pitch period values determined in block 328 are approximately equal to one another. If the answer is "yes" ('Y'), then the frame's pitch period value is set (block 332) equal to the average of the candidate pitch period values determined in block 328, a pitch value confidence flag is set to indicate a high degree of confidence in the pitch period value so determined, and the frame's voicing value variable is set to reflect the fact that the frame is voiced. At block 334 the pitch value is checked to determine whether it reflects multiple or sub-multiple pitch values, and any such aberrations are removed if detected. Processing then continues at point "b" (Figure 2 A) as previously explained.
If the block 330 test is answered "no" ('N') then a further test (block 336) is performed to determine whether some peaks may not have been detected by the aforementioned block 326 processing. If the block 336 test is answered "yes" ('Y') then the peak detection threshold(s) are lowered (or other criteria peak detection criteria are relaxed, as hereinafter explained) and processing continues at point "a" as previously explained. If the block 336 test is answered "no" ('N') then a further test (block 338) is performed to determine whether the frame's signal energy is changing relatively quickly. If the block 338 test is answered "no" ('N') then the frame's voicing and pitch value variables are zeroed (block 340) and processing continues at point "b" (Figure 2A) as previously explained. If the block 338 test is answered "yes" ('Y') then the frame's voicing value variable is set (block 342) to reflect the fact that the frame is voiced; and, the frame's pitch value variable is set to either the value of the pitch variable of the preceding frame or to the maximum allowable pitch value. Processing then continues at block 334, as previously explained.
Detailed Discussion
The following table summarizes the parameters which are used in flowchart Figures 5A-5L and in the following description of the invention: Symbol Description Initial Value m frame counter 0 ε a small constant 10-ιo
p(f) estimated pitch for the/Λ frame />(0)=0 v(j) estimated voicing for the/Λ frame v(0)=0 (v( ) = l indicates voiced frame; v(/)=0 indicates unvoiced frame) fs special voicing flag ( v= l 0 indicates that a decision that the current frame is voiced was made under special circumstances) fs - previous value offsv 0
Jsp pitch value confidence flag (/^,= 1 0 indicates high confidence in currently assigned pitch value)
J sp- previous value off^ 0
J transient voicing transient flag (^^^= 1 0 indicates that voicing changed from voiced to unvoiced or vice versa in the frame for which
J f transi •ent = 1 *)
Paid old pitch value (average value of 0 pitch for an unbroken sequence of voiced frames)
Pv- previous voiced pitch 0 Pv~ previous value of py_ 0
Ppast past value of voiced pitch (equal none to either pold or pv_
Lframe frame length, in samples 320
Figure imgf000013_0001
subframe length, in samples 80 window length for computing 10 energy function
Lt length of peak template 50
Ltrack length of an unbroken sequence 0 of voiced frames
MAX_UVLEVEL maximum sample value for 7000 unvoiced
MAX VLEVEL maximum sample value for 32000 voiced
MIN_VLEVEL minimum sample value for voiced 100
ERATIOMIN_V energy ratio parameter 6.0e+06
PEAK THRESH1 first normalized peak threshold 0.85
PEAK_THRESH2 second normalized peak threshold 90%
PEAK THRE
SHI
PEAK THRESH3 low peak threshold 0.6
DPMAX maximum allowable pitch 8 variation between successive cycles
DPMAX2 a lower pitch variation allowable 6 in special cases
MINTRLEN minimum value Ltrack must reach 3 before pold is updated
MAXP maximum allowable pitch, in 210 samples
MAXP2 maximum quantized pitch, in 147 samples
MINP minimum allowable pitch, in 15 samples
BASELEN parameter used to force minimum 270 frame length to 320 samples
Referring now to Figure 5A, block 10, frame counter m, estimated pitch value p(m) and estimated voicing value v( ) parameters are each initialized to zero. A voicing value v(w)=0 corresponds to unvoiced speech, which has no pitch characteristic. A voicing value v( ) = l corresponds to voiced speech, for which the estimated pitch value p(m) is meaningful.
At block 12 a test is performed to determine whether a speech fame containing speech samples is available for processing. If the answer is "no" ('N'), processing stops (block 14). If the answer is "yes" ('Y'), frame counter m is incremented by one and the special voicing and pitch value confidence flags f^f^ are each initialized to zero (block 16). Abnormalities such as croaking by the speaker may cause signal aberrations such as peak-to-peak interval spacings which exceed the peak-to-peak interval spacings which characterize the speaker's normal voiced speech pitch range. The special voicing flag^v is set to one, as hereinafter explained, to indicate that a voicing value v(m) = l (i.e. voiced speech) has resulted from signal conditions other than those which characterize the speaker's normal voiced speech pitch range, thus indicating that the particular voicing value v(m) = l may be suspect. A pitch value confidence flag^, is set to one, as hereinafter explained, to indicate a high degree of confidence in the determination of an associated estimated pitch value p(m). Certain corrections can be made to the estimated pitch value p(m), as hereinafter explained, if ψ= , indicating lack of high confidence in pirn), f is not set back to one after such corrections are made.
At block 18 the input speech signal segment s(n) (i.e. a "speech frame", for example, a signal consisting of speech sampled at 8 Khz, as depicted in Figure 4 A) is low pass filtered to remove high frequency signal components. Figure 3 depicts the frequency response characteristic of a suitable low pass filter having a cutoff frequency of 500 Hz ("LPF" in block 18). The high frequency components of speech sounds tend to differ considerably from one cycle to the next cycle. Removal of these components makes the signal appear more similar between adjacent cycles, improving the reliability with which each cycle can be identified, and hence improving the reliability with which pitch can be estimated.
The low pass filtered signal is then expanded by cubing ("θ3" in block 18) it to enhance (i.e. amplify) the peak portions of the signal, relative to the non-peak signal portions, as seen in Figure 4B. This makes the peaks stand out, making it easier to detect them, thus improving the reliability with which each peak-to-peak signal cycle can be identified, further improving the reliability with which pitch can be estimated. Squaring the low pass filtered signal would adequately enhance the signal peaks, but cubing preserves the negative-going signal portions, and is therefore preferred.
The low pass filtered, cubed signal is designated x n), where n is the sample number. As explained in detail below, the invention estimates an average value of pitch at a particular time instant by defining a "window" (hereafter "frame") centred at that time instant. All signal peaks of complete signal cycles included within the frame are examined so as to identify those cycles. The interval length of each such cycle is determined. The average interval length for all complete signal cycles included within the frame is then determined. The average interval length value so determined is the average pitch estimate. The energy ratio er of the two signals s(n) and x{ή) is then determined as indicated in block 20. The value ε is arbitrarily small (i.e. ε = 10"10) to prevent division by zero in determination of er, which could otherwise occur, since s(n)=0 if the speaker is silent and no speech sounds are produced. In such case er is set to zero. The maximum magnitude smax attained by the speech signal s(n) throughout the frame is then determined at block 22. The absolute value of s(n) is used to make this determination, because s(ri) may attain its maximum magnitude while negative.
Higher er values are generally characteristic of voiced speech, lower er values are generally characteristic of unvoiced speech, higher s^ values are generally characteristic of voiced speech, and lower jmαc values are generally characteristic of unvoiced speech. Accordingly, at block 24, a broad initial test is performed to determine whether the current frame appears to be voiced; or, unvoiced (and thus has no pitch). Specifically, if -?mαϊ is less than a predefined constant value MIN_VLEVEL characteristic of voiced speech, then the current frame is tentatively characterized as unvoiced by setting the voicing value v( )==0 (block 26). Since unvoiced speech has no pitch, the estimated pitch value p(m) is also zeroed (block 26). The current frame is also tentatively characterized (block 24) as unvoiced if s^^ is less than a predefined constant value MAX UVLEVEL characteristic of unvoiced speech and er is less than a predefined constant value ERATIOMIN_V characteristic of unvoiced speech. Otherwise, the current frame is tentatively characterized as voiced, and processing continues at point "B" (Figure 5B). If the current frame is tentatively characterized as unvoiced, as explained above, then processing continues at point "C" (Figure 51).
Assume that processing at block 24 results in the current frame being tentatively characterized as voiced. Processing accordingly continues at point "B" (block 28, Figure 5B) by testing to determine whether the value of x(n) with the largest magnitude is positive (greater than zero). If the answer is "no" ('N'), then ^(n) is inverted in block 30. The object is to orient x(ri) so that its largest peak is positive-going. This simplifies location of such peak, which aids in determining periodicity and thus pitch. At block 32, a portion y(n) of the frame containing the aforementioned largest peak, plus a few samples on either side of that peak, is extracted. y( ) serves as a template in the cross-correlation performed in block 34. Specifically, y(n) is cross-correlated across the entire frame to yield r^ ) which contains a plurality of substantially equal magnitude peaks, a representative example of which is depicted in Figure 4C together with three predefined peak threshold values
PEAK THRESH1, PEAK THRESH2, and PEAK THRESH3 which are employed as hereinafter explained.
At block 36, the value npl is assigned a value equal to the number of signal peaks in r_y( ) having a magnitude exceeding PEAK_THRESH1. Multiple peak-to-peak intervals with peaks exceeding PEAK_THRESH1 facilitate reliable determination of signal period, and hence pitch. If similar interval widths can be derived for a suitable number of adjacent intervals then the average width of such intervals can be accepted as the pitch estimate with reasonably high confidence in the accuracy of such estimate.
Six of the r^ ) peaks depicted in Figure 4C exceed PEAK THRESHl , so np]=6 for the example shown in Figure 4C. ipl is assigned (block 36) a vector value equal to the positions of those signal peaks in r^ik) having a magnitude exceeding PEAK_THRESH1. The six aforementioned peaks depicted in Figure 4C occur at k=9, £=56, £= 104, £= 150, £= 197, and £=244 respectively, so ipl=(9, 56, 104, 150, 197, 244) for the example shown in Figure 4C.
At block 38, a test is performed to determine whether npl <2. If npl <2, then r^(£) does not contain at least two peaks having a magnitude exceeding PEAK_THRESH1, making it impossible to determine any peak-to-peak interval width for r^ ). In such case, the current frame is characterized as unvoiced (block 40) by zeroing the voicing value v( ) and the pitch value p(m), and processing then continues at point "D" (Figure 5G), as hereinafter explained. lf npl≥2, then r^ik) contains at least two peaks having a magnitude exceeding PEAK_THRESH1, facilitating determination of peak-to-peak interval width(s) for r_y(£), as indicated in block 42. In particular, the peak-to-peak interval width(s) /?l(£) are determined for all adjacent signal peaks in r^Qc) having a magnitude exceeding PEAK_THRESH1. Then, the variations Δp,(/) between those interval width(s) are determined for all such adjacent signal peaks in r^ c). For the example shown in Figure 4C, i =(47, 48, 46, 47, 47) and Δ_,j =(l, -2, 1 , 0) respectively.
At block 44, a test is performed to determine whether npl=2. If npl =2, then r^ ) contains exactly two peaks having a magnitude exceeding PEAK_THRESH1 , meaning that only one peak-to- peak interval width can be determined relative to such peaks, in which case processing continues at point "E" (Figure 5C), as hereinafter explained. lf npl≠2, then r^ ) contains more than two peaks having a magnitude exceeding PEAK_THRESH1 , meaning that more than one peak-to-peak interval width can be determined relative to such peaks, in which case processing continues at point "F" (Figure 5F).
Assume that processing at block 44 reveals that npl≠2. Processing accordingly continues at block 46 (Figure 5F), by performing a test to determine whether the maximum variation in interval width Δp, is less than or equal to the maximum allowable pitch variation between successive cycles (i.e. DPMAX). If the answer is "no" ('N'), then processing continues at block 156 as hereinafter explained. If the answer is "yes" ('Y'), then the current frame is characterized as voiced (i.e. by setting the voicing value v(m) = l as indicated in block 48) having a pitch value p(m) equal to the average peak-to-peak interval width p\ for all adjacent signal peaks in r^ ) having a magnitude exceeding PEAK_THRESH1.
As indicated in the above table, DPMAX=8 initially. However in practice, the pitch variation between successive cycles is rarely as high as 8. So, confidence in the reliability of the pitch value assigned in block 48 is not high. A more stringent test is accordingly performed (block 50) to determine whether the maximum variation in interval width Δ.,, is less than or equal to one-half the maximum allowable pitch variation between successive cycles (i.e. DPMAX/2). If the answer is "no" ('N'), then processing continues at point "D" (Figure 5G), as hereinafter explained. However, if the answer is "yes" ('Y'), then the pitch value confidence flag ^ is set (block 52) to indicate a high degree of confidence in the reliability of the pitch value assigned in block 48. Processing then continues at point "D" (Figure 5G).
Pitch values which characterize normal human speech can vary widely. However, the present invention is directed to low bit rate speech coders, which do not require accurate determination of particularly high or low pitch values, since such values do not significantly affect the speech coding quality of such coders. Moreover, it is relatively difficult to accurately determine particularly high or low pitch values. Accordingly, when processing continues at point "D" (Figure 5G), a test is performed (block 54) to determine whether the pitch value p(m) is particularly high or low (i.e. exceeds the predefined constant MAXP or is exceeded by the predefined constant MINP). If the answer is "yes" ('Y'), then such particularly high or low values are ignored by characterizing the current frame as unvoiced (block 56), by zeroing the voicing value v( ) and the pitch value p{m). If the answer is "no" ('N'), then a further test (block 58) is performed to determine whether the current frame has already been characterized as unvoiced. This is necessary because, as an examination of Figures 5A-5L will reveal, point "D" can be reached by following a number of different paths along which the current frame may have been characterized as unvoiced prior to reaching point "D".
If the current frame is characterized as unvoiced, then processing continues at block 222, as hereinafter explained. If the current frame is not characterized as unvoiced, then processing con- tinues at point "C" (Figure 51). At point "C" a test is performed (block 60) to verify that the current frame is characterized as voiced (i.e. v( ) = 1) and that the special voicing flag^v is cleared (i.e.fsv=0). If the answer is "no" ('N'), then a further test (block 62) is performed to determine whether the current frame is characterized as unvoiced (like point "D", point "C" can be reached by following a number of different paths along which the current frame may have been characterized as unvoiced prior to reaching point "C"). If the answer to this further test is "no" ('N'), then processing continues at point "L" (Figure 5K), as hereinafter explained. If the answer to this further test is "yes" ('Y'), then the current values of the variables Lnadi andpold are saved (block 64) in the variables Lttad,_, pold. respectively, and the values of ffβcΛ and pold are re-initialized to zero. This facilitates error correction, as hereinafter explained, in the event that subsequent processing reveals an error in a previously assigned pitch value. Processing then continues at point "L" (Figure 5K), as hereinafter explained. If the block 60 test answer is "yes" ('Y'), then a further test (block 66) is performed to determine whether the variable pold (which represents the average pitch value for an unbroken sequence of voiced frames) exceeds its initial value of zero. If the answer is "yes" ('Y'), then the current value of pold is saved (block 68) in the variable p If the answer is "no" ('N'), then the current value of the variable pv, is saved (block 70) in the variable ppasr
In speech produced by some speakers, phenomena termed "pitch doubling", "pitch halving", "pitch thirding" or "pitch quartering" may be observed. For example, pitch doubling is characterized by a reduction in the magnitude of every other peak in the speaker's speech sounds. This can result in incorrect determination of the pitch of such speech sounds as double the correct pitch value. In particular, every other peak may be excluded from the peaks used to determine peak-to- peak interval length (and hence pitch) if the magnitude of every other peak does not exceed the threshold value used to identify the peaks. Similarly, pitch halving can result in incorrect determination of pitch as one-half the correct value, pitch thirding can result in incorrect determination of pitch as one-third the correct value, and pitch quartering can result in incorrect determination of pitch as one-quarter the correct value. To address the problems of pitch doubling, etc., a test (block 72) is performed to determine whether the pitch value p(m) is less than 80 and whether the pitch value confidence flag/^, is cleared. If the answer is "no" ('N') in either case, then processing continues at point "N" (Figure 5J), as hereinafter explained. If the answer is "yes" ('Y') in both cases, then a pitch quartering test (block 74) is performed to determine whether the absolute value of the past value of voiced pitch ppast less four times the pitch value p(m) is less than or equal to the value of the variable DPMAX2. If the answer is "yes" ('Y'), then it is concluded that the current frame is characterized by the pitch quartering phenomenon and the phenomenon's effect is removed by quadrupling the pitch value p(m) (block 76). Processing then continues at point "N" (Figure 5J), as hereinafter explained.
If the pitch quartering test (block 74) answer is "no" ('N'), then a pitch thirding test (block 78) is performed to determine whether the absolute value of the past value of voiced pitch ppas( less three times the pitch value p(m) is less than or equal to the value of the variable DPMAX2. If the answer is "yes" ('Y'), then it is concluded that the current frame is characterized by the pitch thirding phenomenon and the phenomenon's effect is removed by trebling the pitch value p{m) (block 80). Processing then continues at point "N" (Figure 5J), as hereinafter explained.
If the pitch thirding test (block 78) answer is "no" ('N'), processing continues at point "M" (Figure 5J) with a pitch halving test (block 82) to determine whether the absolute value of the past value of voiced pitch ppasl less twice the pitch value p(m) is less than or equal to the value of the variable DPMAX2. If the answer is "yes" ('Y'), then it is concluded that the current frame is characterized by the pitch halving phenomenon and the phenomenon's effect is removed by doubling the pitch value p{m) (block 84). If the pitch halving test (block 82) answer is "no" ('N'), or after doubling of the pitch value p(m) (block 84), processing continues at point "N", with a pitch doubling test (block 86) to determine whether the pitch value exceeds 120, and whether the absolute value of the past value of voiced pitch ppast less half the pitch value p{m) is less than or equal to the value of the variable DPMAX2, and whether the pitch value confidence flag^ is not set. If the answer is "yes" ('Y') in all three cases, then it is concluded that the current frame is characterized by the pitch doubling phenomenon and the phenomenon's effect is removed by halving the pitch value p(m) (block 88). If any of the pitch doubling test determinations (block 86) are answered "no" ('N'), or after halving the pitch value p(m) (block 88), a further test (block 90) is performed to determine whether the value of Lnack (the current length of the unbroken sequence of voiced frames) is greater than or equal to a predefined constant MINTRLEN. This constant fixes at 3 the number of voiced frames which must occur in unbroken sequence before the variable pold is updated. If the block 90 test answer is "yes" ('Y'), then/?oW is updated (block 92) by assigning pold a value equal to the average of the pitch values determined for the current frame and the immediately preceding two frames. If the block 90 test answer is "no" ('N'), or after updating pold (block 92), the current value of the variable pv_ is stored in the variable pv_ (block 94). A test (block 96) is then performed to determine whether the pitch value pirn) is less than or equal to the value of the predefined maximum quantized pitch constant MAXP2 (which is initialized at 147). Pitch values exceeding 147 are rare, so pitch values determined to exceed 147 are of questionable reliability. This is recognized by bypassing block 98, in which the pitch value p(m) is stored in the variable pv_, if the block 96 test reveals a pitch value exceeding 147. Processing then continues at point "L" (Figure 5K).
At point "L", a test is performed (block 100) to determine whether the voicing transient flag ^a^^ has been set. The voicing transient flag is set to indicate that, during processing of the current frame, a decision has been made to change the frame's characterization from voiced to unvoiced, or vice-versa. If the voicing transient flag is set (i.e. ifftransient= 1) then a further test is made (block 102) to deter- mine whether the current frame's voicing characterization is not the same as that of the immediately preceding frame. If the answer is "yes" ('Y'), then processing continues at block 258, as hereinafter explained. If the answer is "no" ('N'), then the voicing transient flag is cleared (block 104). If the block 100 test reveals that the voicing transient flag is not set (i.e. if ra^^^O) then a further test is made (block 106) to determine whether the current frame's voicing characterization is not the same as that of the immediately preceding frame. If the answer is "yes" ('Y'), then the voicing transient flag is set (block 108). If the answer is "no" ('N'), or after setting the voicing transient flag, pro- cessing continues at point "O" (Figure 5L), as hereinafter explained. The objective of the above-described processing in blocks 100-108 is to set or clear the voicing transient flag to facilitate correction of v( ) if a transient occurrence is detected, such as a single voiced frame occurring in the midst of a series of unvoiced frames. Whenever a voicing transition occurs (i.e. from voiced to unvoiced, or vice versa), the voicing transient flag is set to reflect such change and denote a possible transient occurrence. If the voicing transient flag is already set when processing reaches block 100, and if the current and immediately preceding frames have the same voicing classification (i.e. both voiced, or both unvoiced), then it is concluded that a valid (i.e. non-transient) voicing transition has occurred; hence, the voicing transient flag is cleared (block 104). But, if the voicing transient flag is not set when processing reaches block 100, and if the current and immediately preceding frames have different voicing classifications, then it is concluded that a new and possibly transient voicing transition has occurred; hence the voicing transient flag is set (block 108).
If a valid (non-transient) voicing transition is detected as aforesaid, a test is performed (block 110) to determine whether the current frame is characterized as unvoiced. If the answer is "no" ('N'), then processing continues at point "O" (Figure 5L), as hereinafter explained. If the answer is "yes" ('Y'), then the variables pold. and L^^. are re-initialized to zero (block 112). Processing then continues at point "O" (Figure 5L), as will now be explained.
At point "O", a test is performed (block 114) to verify that the current frame is characterized as voiced (i.e. v( ) = l) and that the special voicing flag jy is cleared (i.e.fsv=0). If the answer is "yes" (Υ1), then a further test is performed (block 116) to determine whether the average pitch value pold determined in respect of the most recent unbroken sequence of voiced frames exceeds zero, and the absolute value of the pitch value p m) less pold exceeds twice the value of the variable DPMAX, and the pitch value confidence lagfψ is cleared. If the answer is "yes" ('Y') in all three cases, then it is concluded that an error has occurred and that the current frame's pitch value pirn) is incorrect. A correction is applied by resetting p(m) to a "safe" value, namely the previous frame's pitch value (block 118). Thus, successive pitch values contained within an unbroken sequence of valid pitch values (indicated by pold > 0) are not permitted to differ by more than twice DPMAX unless the pitch confidence flag ^, is set to indicate high confidence in the pitch value.
If any of the block 116 determinations are answered "no" ('N'), or after setting the current frame's pitch value p(m) in block 118, a further test is made (block 120) to determine whether the absolute value of the current frame's pitch value p(m) less that of the immediately preceding frame p(m-l) is less than the value of the variable DPMAX. If the answer is "no" ('N'), then the pitch variation between the two successive frames exceeds the DPMAX threshold, breaking the sequence of voiced frames. This is reflected by reinitializing the variables L^^ and pold to zero (block 122). If the answer is "yes" ('Y'), then the sequence of voiced frames continues unbroken, which is reflected by incrementing the variable L^^ (block 124). The special voicing flag^v is set to one, as hereinafter explained, to indicate that a voiced speech determination for the current frame (i.e. v( ) = l) has resulted from signal conditions other than those which characterize the speaker's normal voiced speech pitch range, thus indicating that the particular voicing value v( ) = 1 may be suspect. If the special voicing flag is set, and if the next frame exhibits normal voicing characteristics, then it is highly likely that the next frame's pitch value will differ significantly from that of the frame for which the special voicing flag was set. This potential difference is accommodated by relaxing the value of the variable DPMAX, which determines the maximum allowable pitch variation between successive cycles. In particular, after the variable L^^ has been updated via either of blocks 122 or 124, a further test is made (block 126) to determine whether the special voicing flag^v is set (i.e. jv= l). If the answer is "yes" ('Y'), then the value of the variable DPMAX is reset to 14 (block 128). If the answer is "no" ('N'), then the value of the variable DPMAX is reset to either 8, or 10% of the previously determined voiced pitch value pv_ rounded to the nearest integer, whichever is greater (block 130). Finally, at block 132, the variables Lt, Lframe,fsv. and f^, are updated. Lt is set to either 50, or twice the integer part of 30% of the previously determined voiced pitch value pv_, whichever is greater. Lframe is then set to the value of the parameter BASELEN plus the updated value of Lr The current value of the special voicing fla ^v is stored infsv_, and the current value of the pitch value confidence flag/^, is stored in fψ_. Processing of the next frame then proceeds, commencing at point "P" (Figure 5A), as described above. Returning now to Figure 5B, assume that processing at block 44 reveals that npl=2. Processing accordingly continues at point "E" (Figure 5C), by assigning the parameter np2 a value equal to the number of signal peaks in r^ ) having a magnitude exceeding PEAK_THRESH2 (block 134). If only two peaks exceed PEAK_THRESH1 (i.e. npl=2) then only one peak-to-peak interval width can be determined relative to such peaks. This introduces a reliability problem, since there are no other intervals to compare with such one interval. Although the above-described process of computing r.y( ) is expected to produce peaks of approximately equal height, some peaks may be slightly shorter than PEAK_THRESH1. This is tested by defining a second threshold PEAK_THRESH2, which is set at 90% of PEAK_THRESH1 , and determining the number of peaks having a magnitude exceeding PEAK_THRESH2.
All six of the peaks shown in Figure 4C exceed PEAK_THRESH2, so np2=6 for the example in Figure 4C. ip2 is assigned (block 134) a vector value equal to the positions of those signal peaks in r^ ) having a magnitude exceeding PEAK_THRESH2. The six aforementioned peaks occur at £=9, 56, 104, 150, 197 and 244 respectively, so ip2=(9, 56, 104, 150, 197, 244) for the example shown in Figure 4C. The peak-to-peak interval widths, p2(k) are determined (block 136) for all adjacent peaks in r^( ) having a magnitude exceeding PEAK_THRESH2. Then the variations Δp2(/) between those interval widths are determined for all such adjacent signal peaks in r^ ). For the example shown in Figure 4C, p2=(47, 48, 46, 47, 47) and Δp2=(l, -2, 1 , 0) respectively. Following determination of p2(k) and Δp2( ) as aforesaid, a test is performed (block 138) to determine whether np2 >2. If the answer is "no" ('N'), then r^ik) does not contain more than two peaks having a magnitude exceeding PEAK_THRESH2, in which case processing continues at point "G" (Figure 5D), as hereinafter explained. If the answer is "yes" ('Y'), then r^ ) contains more than two peaks having a magnitude exceeding PEAK_THRESH2. In such case, a further test is performed (block 140) to determine whether the maximum variation in interval width Δp2 is less than or equal to the maximum allowable pitch variation between successive cycles (i.e. DPMAX). If the answer is "yes" ('Y'), then the current frame is characterized as voiced (i.e. by setting the voicing value v( ) = l as indicated in block 142) having a pitch value p(m) equal to the average peak-to-peak interval width p2 for all adjacent signal peaks in r^ik) having a magnitude exceeding PEAK_THRESH2.
As previously explained, in practice, the pitch variation between successive cycles is rarely as high as the initialized value of DPMAX = 8. So, confidence in the reliability of the pitch value assigned in block 142 is not high. A more stringent test is accordingly performed (block 144) to determine whether the maximum variation in interval width Δp2 is less than or equal to one-half the maximum allowable pitch variation between successive cycles (i.e. DPMAX/2). If the answer is "no" ('N'), then processing continues at point "D" (Figure 5G), as previously explained. However, if the answer is "yes" ('Y'), then the pitch value confidence flag ^, is set (block 146) to indicate a high degree of confidence in the reliability of the pitch value assigned in block 142. Processing then continues at point "D" (Figure 5G), as previously explained. If the test performed at block 140 determines that the maximum variation in interval width Δp2 is not less than or equal to the maximum allowable pitch variation between successive cycles (i.e. DPMAX), then the maximum and minimum interval width values max(p2), min(/?2) are saved as p2max, p2nώl respectively (block 148). A test is then performed (block 150) to determine whether any sub- multiple of the maximum interval width value, less the minimum interval width value, is less than or equal to DPMAX2. At this point, it is known that more than two peaks have been detected, as revealed by the outcome of the block 138 test. This means that more than one peak- to-peak interval have been detected for the current frame. There are only two possibilities if any one of the detected intervals is a pitch period: (i) all of the detected intervals are of approximately identical width (i.e. the block 140 test outcome is "Yes"), in which case the width of each detected interval is a pitch period; or, (ii) some peaks remain undetected because they do not exceed PEAK_THRESH2, in which case some of the interval widths are equal to multiples of actual pitch periods within some small variation. The block 150 test detects the latter possibility. If the block 150 test answer is "no" ('N'), then the current frame is characterized as unvoiced (block 152) by zeroing the voicing value v(m) and the pitch value p(m). If the answer is "yes" ('Y'), then the current frame is characterized as voiced (i.e. by setting the voicing value v(m) = l as indicated in block 154) having a pitch value p(m) equal to the weighted average ofp2max, p2min. Since p2max is found to be approximately equal toy* pitch periods and p2nύn is equal to one pitch period, the average pitch period over the frame is found by taking a weighted average of the two. Processing then continues at point "D" (Figure 5G), as previously explained.
Returning now to Figure 5F, assume that processing at block 46 reveals that the maximum variation in interval width Δp, is not less than or equal to the maximum allowable pitch variation between successive cycles (i.e. DPMAX). Processing accordingly continues at block 156 by assigning the parameter np3 a value equal to the number of signal peaks in r^ik) having a magnitude exceeding PEAK_THRESH3. Six of the r^ik) peaks depicted in Figure 4C exceed PEAK_THRESH3, so ^.,=6 for the example shown in Figure 4C. ip3 is assigned (block 156) a vector value equal to the positions of those signal peaks in r^(£) having a magnitude exceeding PEAK_THRESH3. The six aforementioned peaks depicted in Figure 4C occur at £=9, £=56, £= 104, £= 150, £= 197 and £=244 respectively, so ip3=(9, 56, 104, 150, 197, 244) for the example shown in Figure 4C. The peak-to-peak interval width(s) p3(k) are determined (block 158) for all adjacent signal peaks in r^(k) having a magnitude exceeding PEAK_THRESH3. Then, the variations Δp3(/) between those interval width(s) are determined for all such adjacent signal peaks in r^ c). For the example shown in Figure 4C, /?3 = (47, 48, 46, 47, 47) and Δp3=(l, -2, 1 , 0) respectively.
Following determination of /?3(£) and Δp3(/) as aforesaid, a test is performed (block 160) to determine whether the number of signal peaks in r^k) having a magnitude exceeding PEAK_THRESH1 is the same as the number of signal peaks in r^ c) having a magnitude exceeding PEAK_THRESH3 (i.e. npl=np3, as is true for the case depicted in Figure 4C). If the answer is "yes" ('Y'), then a further test is performed (block 162) to determine whether the positions of the peaks in r^(£) whose magnitude exceeds PEAK_THRESH1 respectively coincide with the positions of the peaks in r^ ) whose magnitude exceeds PEAK_THRESH3 (as is true for the case depicted in Figure 4C). If this further test is answered "yes" ('Y'), then the current frame is characterized as voiced (i.e. by setting the voicing value v( ) = l as indicated in block 164) having a pitch value p(rή) equal to the last pitch period detected within the current frame using either one of the PEAK THRESH1 or PEAK THRESH3 thresholds. The tests in blocks 160, 162 ensure that same peaks are detected in both cases. For the example shown in Figure 4C, pi = p3 = (47, 48, 46, 47, 47), and npl = np3 = 6, so the pitch value for the current frame, pirn), is set to the last value p\(5) = 47. The pitch value confidence flag j-, is then set (block 164) to indicate a high degree of confidence in the reliability of the pitch value assigned in block 164. Processing then continues at point "D" (Figure 5G), as previously explained.
If the test performed in block 160 is answered "no" ('N'), or if the test performed in block 162 reveals that the positions of the peaks in r^ c) whose magnitude exceeds PEAK_THRESH1 do not coincide with the positions of the peaks in r^ik) whose magnitude exceeds PEAK_THRESH3, then the maximum and minimum interval width values max(pl), min(pl) are saved z.s p\max, pl^n respectively (block 166). A "no" answer to the block 160 or 162 tests implies that some peaks detected using PEAK_THRESH3 were not detected using PEAK_THRESH1. In such case, the largest interval detected using PEAK_THRESH1 may comprise multiple pitch periods. A test is accordingly performed (block 168) to determine whether any sub- multiple of the maximum interval width value less the minimum interval width value, is less than or equal to DPMAX2. If the answer is "no" ('N'), then the largest interval detected using PEAK_THRESH1 does not consist of multiple pitch periods, and the frame is characterized as unvoiced (block 170) by zeroing the voicing value v( ) and the pitch value pirn). If the answer is "yes" ('Y') then the largest interval detected using PEAK_THRESH1 most probably does consist of multiple pitch periods, and the frame is characterized as voiced with pitch value p(m) equal to the weighted average of the largest and the smallest intervals, pl^ and /?!„„„. Processing then continues at point "D" (Figure 5G), as previously explained.
Returning now to Figure 5C, assume that processing at block 138 reveals that r^( ) does not contain more than two peaks having a magnitude exceeding PEAK_THRESH2 (i.e. np2 is not greater than 2). Processing accordingly continues at point "G" (Figure 5D) with a further test (block 174) to determine whether about one pitch cycle is expected. If the answer is "yes" ('Y'), then the current frame is characterized as voiced (i.e. by setting the voicing value v( ) = l as indicated in block 176) having a pitch value p(m) determined by the peak-to-peak interval width between the only two adjacent signal peaks in r^ik) having a magnitude exceeding PEAK_THRESH2. A test is then performed (block 178) to determine whether the pitch value confidence flag was set in respect of the previously processed frame (i.e. -,.= 1), reflecting high confidence in the pitch value determined in respect of the previously processed frame. If the answer is "yes" (Υ1), then the pitch value confidence flagJ^, is set (block 180) to reflect high confidence in the pitch value assigned in block 176. Processing then continues at point "D" (Figure 5G), as previously explained.
If the block 174 test determines that about one pitch cycle was not expected, then processing continues at block 182 by assigning the parameter np3 a value equal to the number of signal peaks in r^ik) having a magnitude exceeding PEAK_THRESH3. Six of the r^k) peaks depicted in Figure 4C exceed PEAK_THRESH3, so np3=6 for the example shown in Figure 4C. ip3 is assigned (block 182) a vector value equal to the positions of those signal peaks in r^k) having a magnitude exceeding PEAK_THRESH3. The six aforementioned peaks depicted in Figure 4C occur at £=9, £=56, £= 104, £= 150, £= 197 and £=244 respectively, so ip3=(9, 56, 104, 150, 197, 244) for the example shown in Figure 4C. The peak-to-peak interval width(s) p3(k) are determined (block 184) for all adjacent signal peaks in r^(£) having a magnitude exceeding PEAK_THRESH3. Then, the variations Δp3(/) between those interval width(s) are determined for all such adjacent signal peaks in r^(£). For the example shown in Figure 4C, /?3 =(47, 48, 46, 47, 47) and Δp3 = (l , -2, 1, 0) respectively. Following determination of p3(k) and Δp3(/) in block 184, a test is performed (block 186) to determine whether the number of signal peaks in r^ ) having a magnitude exceeding PEAK_THRESH3, less the number of signal peaks in r^k) having a magnitude exceeding PEAK_THRESH1 exceeds 1 (i.e. np3-npl > \). If the answer is "no" ('N'), then processing continues at point "H" (Figure 5E), as hereinafter explained. If the answer is "yes" ('Y'), then a further test is performed (block 188) to determine whether the maximum variation in interval width Δp3 is less than or equal to the maximum allowable pitch variation between successive cycles (i.e. DPMAX). If the answer is "yes" (Υ1), then the current frame is characterized as voiced (i.e. by setting the voicing value v( ) = 1 as indicated in block 190) having a pitch value p(m) equal to the average peak-to-peak interval width p3 for all adjacent signal peaks in r^k) having a magnitude exceeding PEAK_THRESH3. Processing then continues at point "D" (Figure 5G), as previously explained. If the block 188 test determines that the maximum variation in interval width Δp3 is not less than or equal to the maximum allowable pitch variation between successive cycles then the pitch values determined in respect of all signal peaks in r^(£) having a magnitude exceeding PEAK_THRESH3 are examined (block 192) to identify any case in which the absolute value of the difference between any two such pitch values is less than the value of the variable DPMAX. A test (block 194) is then performed to determine whether any such case has been identified. If the answer is "no" ('N'), then the current frame is characterized (block 196) as unvoiced by zeroing the voicing value v( ) and the pitch value p(m) and then continuing processing at point "D" (Figure 5G), as previously explained. If the answer is "yes" ('Y'), then λp is assigned (block 198) as the ratio of the pitch value defined by the peak-to-peak interval width between the first two adjacent signal peaks in r^ ) having a magnitude exceeding PEAK_THRESH1 , to the average of the two pitch values identified in block 192. A test (block 200) is then performed to determine whether the difference between λp and the nearest integer is less than 0.1. If the answer is "no" ('N'), then the current frame is characterized as unvoiced (block 202), by zeroing the voicing value v(m) and the pitch value p(m). If the answer is "yes" ('Y'), then the current frame is characterized as voiced (i.e. by setting the voicing value v(m) = 1 as indicated in block 204) having a pitch value p(m) equal to the average of the two pitch values identified in block 192. Processing then continues at point "D" (Figure 5G), as previously explained.
Returning now to block 186, assume that the test performed in that block reveals that the number of signal peaks in r^ ) having a magnitude exceeding PEAK_THRESH3, less the number of signal peaks in r^Q) having a magnitude exceeding PEAK_THRESH1 does not exceed 1. In this case, processing continues at point "H" (Figure 5E) with a further test (block 206) to determine whether more than two signal peaks in r^ik) have a magnitude exceeding PEAK_THRESH3 (i.e. np3>2). If the answer is "no" ('N'), then the current frame is characterized as voiced (i.e. by setting the voicing value v( ) = l as indicated in block 208) having a pitch value p{m) determined by the peak-to-peak interval width between the only two adjacent signal peaks in r^ik) having a magnitude exceeding PEAK_THRESH1. If the answer is "yes" ('Y'), then the maximum and minimum interval width values max(p3), min( ?3) are saved as p3max, p3min respectively (block 210).
Peaks exceeding the PEAK_THRESH3 threshold may not be reliable indicators of pitch since the PEAK_THRESH3 threshold is relatively low. Further testing is required to verify that large peak-to- peak intervals identified via the PEAK_THRESH3 threshold are reliable indicators of pitch. A test is accordingly performed (block 212) to determine whether any sub-multiple of the maximum interval width value less the minimum interval width value, is less than or equal to DPMAX2. If the answer is "no" ('N'), the intervals identified via the PEAK_THRESH3 threshold are not pitch intervals. The current frame is therefore characterized as unvoiced (block 214) by zeroing the voicing value v( ) and the pitch value p(m). Processing then continues at point "D" (Figure 5G), as previously explained. If the answer is "yes" ("Y'), then the large interval identified via the PEAK_THRESH3 threshold most probably is a pitch multiple. The current frame is therefore characterized as voiced (i.e. by setting the voicing value v( ) = 1 as indicated in block 216) having a pitch value p(m) equal to the weighted average of the largest and the smallest intervals, p3max and p3min. A test is then performed (block 218) to determine whether exactly two signal peaks in r_y(£) have a magnitude exceeding PEAK_THRESH3 (i.e. np3=2). If the answer is "no" ('N'), then processing continues at point "D" (Figure 5G), as previously explained. However, if the answer is "yes" ('Y'), then the pitch value confidence flag p is set (block 220) to indicate a high degree of confidence in the reliability of the pitch value assigned in block 216. Processing then continues at point "D" (Figure 5G), as previously explained.
Returning now to Figure 5G, assume that processing at block 58 reveals that the current frame has already been characterized as unvoiced. Processing accordingly continues at block 222 by computing the energy function e(k) over the current frame in accordance with the formula provided in block 222. A test (block 224) is then performed to determine whether one or more signal peaks in r^ ) have a magnitude exceeding PEAK_THRESH1 (i.e. npl > \). If the answer is "no" ('N'), then processing continues at point "J" (Figure 5H), as hereinafter explained.
If the block 224 test's answer is "yes" ('Y'), then maximum and minimum values of e k) are determined for the first (i.e. "left") and second (i.e. "right") halves of the current frame and stored in the variables eimax, elnύn, e^, and eτnάn respectively (block 226). The ratio of left half maximum and minimum values is then defined as μ, (block 228), with μr defining the ratio of right half maximum and minimum values. A test (block 230) is then performed to determine whether μt is less than μr. If the answer is "yes" ('Y'), then bμ is defined (block 232) as μ/μ,; otherwise bμ is defined (block 234) as μ;r. Processing then continues at point "I" (Figure 5H).
At point "I" (Figure 5H), a test (block 236) is performed to determine whether both μ, and μr exceed 215 and bμ is less than 25. If the answer is "no" ('N'), then processing continues at block 244, as hereinafter explained. If the answer is "yes" ('Y'), then the current frame is characterized as voiced (i.e. by setting the voicing value v( ) = l as indicated in block 238) having a pitch value p(m) equal to the last interval detected from peaks in r^ik) whose magnitude exceed PEAK_THRESH1. A further test (block 240) is then performed to determine whether the pitch value p(m) is outside the allowable pitch value range defined by the MINP and MAXP parameters. If the answer is "no" ('N'), then processing continues at block 244, as hereinafter explained. If the answer is "yes" ('Y'), then such out-of-range pitch values are ignored by characterizing the current frame as unvoiced (block 242), by zeroing the voicing value v( ) and the pitch value p m). Processing then continues at block 244 (which is also reached when processing continues at point "J", as previously mentioned) by determining the maximum value -?mαc attained by the speech signal s(k) within two sub-frames centred on the current frame. (As indicated in the above table, the sub-frame length is 80 samples and the frame length is 320 samples. Accordingly, two sub-frames centred on the current frame comprises 160 samples, with 80 samples on either side of the centre of the frame.) A test (block 246) is performed to determine whether s,^ exceeds the maximum allowable signal magnitude for unvoiced sounds (MAX_UVLEVEL) and is also lower than the maximum allowable signal magnitude for voiced sounds (MAX_VLEVEL). If the answer is "yes" ('Y'), a further test (block 248) is performed to determine whether more than one peak in r^ ) exceeds PEAK_THRESH1 in magnitude. If the block 246 test is answered "no" ('N'), or if the block 248 test is answered "yes" (Υ1), then processing continues at point "C" (Figure 51), as previously explained. If the block 248 test is answered "no" ('N'), then maximum and minimum values of e(k) are determined for the current frame and stored in the variables emar, emin respectively (block 250), with the ratio e^le^ being defined as μ. Processing then continues at point "K" (Figure 51).
At point "K" (Figure 51), a test (block 252) is performed to determine whether μ exceeds 215. If the answer is "no" ('N'), then the current frame is characterized as unvoiced (block 254), by zeroing the voicing value v(m); and, the special voicing flag ^v is cleared (i.e. fsv= ). Processing then continues at block 60 (point "C"), as previously explained. If the answer is "yes" ('Y'), then the current frame is characterized as voiced (block 256) by setting the voicing value v( ) = l having a pitch value p(m) equal to the maximum allowable value
MAXP, and the special voicing flag^v is set to indicate that the voicing decision was made under special circumstances. No pitch interval value is available at this point, since only one peak was detected. But the signal level 5, mαχ is higher than the maximum allowed for an unvoiced frame. Also, the signal has sufficient dynamic range as the ratio of maximum to minimum energy levels (i.e. μ) is high. Hence the pitch value is set to the maximum allowable value MAXP and the special voicing flag is set. Processing then continues at block 60 (point "C"), as previously explained. Returning now to Figure 5K, assume that processing at block 102 reveals that the current frame's voicing characterization is not the same as that of the immediately preceding frame. A further test is then performed (block 258) to determine whether the current frame is characterized as voiced. If the answer is "no" ('N'), then the immedi- ately preceding frame is re-characterized as unvoiced (block 260) by zeroing its voicing and pitch values (i.e. v( -l)=0, p(m-l)=0), the value of the variable pv_ is reset to that of pv_, and the voicing transient ^ f transient is cleared. If the answer is "yes" ('Y'), then a further test (block 262) is performed to determine whether the special voicing flag fsv is cleared (i.c. fsv= ). If the block 262 test is answered "no" ('N'), then a still further test (block 264) is made to determine whether the absolute value of the current frame's pitch value (i.e. pirn)) less that of the frame which precedes the immediately preceding frame (i.e. p(m-2)) is less than 1.5 times the value of the variable DPMAX. If the block 264 test is answered "no" ('N'), then processing continues at point "O" (Figure 5L), as previously explained. If the block 264 test is answered "yes" ('Y'), then the immediately preceding frame is re-characterized as voiced (block 266) by setting its voicing value (i.e. v( -l) = l), with the pitch value p(m-l) being reset to the average of the pitch values of the current frame and that of the frame which precedes the immediately preceding frame. The value of the variable Ltraek is reset to that of Z^.*., pold is reset to oW_, and the voicing transient flagJ^^,, is cleared. Processing then continues at point "O" (Figure 5L), as previously explained.
If the block 262 test is answered "yes" ('Y'), then the immediately preceding frame is re-characterized as voiced (block 268) by setting its voicing value (i.e. v( -l) = l), the value of the variable Ltrack is reset to that of ^*., pold is reset to pold_, and the voicing transient flag rααrien/ is cleared. A test (block 270) is then performed to determine whether the variable pold (which represents the average pitch value for an unbroken sequence of voiced frames) exceeds its initial value of zero. If the answer is "yes" ('Y'), then the pitch value p(m-l) of the immediately preceding frame is reset (block 272) to the value fpold. Processing then continues at point "O" (Figure 5L), as previously explained. If the block 270 test is answered "no" ('N'), then the pitch value p(m-l) of the immediately preceding frame is reset (block 274) to the average of the pitch values of the current frame and that of the frame which precedes the immediately preceding frame. Processing then continues at point "O" (Figure 5L), as previously explained.
As will be apparent to those skilled in the art in the light of the foregoing disclosure, many alterations and modifications are possible in the practice of this invention without departing from the spirit or scope thereof. For example, instead of cubing the low pass filtered signal to enhance its peaks while preserving the negative-going signal portions, one could alternatively use a fifth, seventh or other odd power version of the low pass filtered signal; or, another non-linear expanding function. However, such alternatives would normally require additional processing without significantly improving signal peak detection capability. Persons skilled in the art will also appreciate that the methodology of the invention may be applied not only to speech signals, but also to "residual" signals (i.e. filtered speech signals which have been further processed by an LPC analysis filter), and also to a combination of such speech and residual signals. Further, persons skilled in the art will understand that the various parameters tabulated above can be altered to adapt the methodology of the invention to suit particular performance needs. Accordingly, the scope of the invention is to be construed in accordance with the substance defined by the following claims.

Claims

WHAT IS CLAIMED IS:
1. A method of transforming a speech signal segment s(n) into a signal r^ik) having a plurality of substantially equal magnitude peaks with each adjacent pair of said peaks separated by a distance correspon- ding to one pitch cycle of s(ή) if s(n) is voiced, said method characterized by:
(a) filtering s(n) to remove high frequency signal components therefrom and to produce a filtered replica thereof (304);
(b) magnitude expanding (304) said filtered replica to produce a magnitude expanded and filtered signal x(n)
(c) locating (308) a largest magnitude signal peak within x(n);
(d) deriving (322) a template yin) comprising a portion of x(n) containing said largest magnitude signal peak; and,
(e) cross-correlating (324) y(μ) across x(n) to produce r^k).
2. A method as defined in claim 1 , wherein said filtering is further characterized by low pass filtering sin).
3. A method as defined in claim 2, wherein said magnitude expanding is further characterized by cubing said filtered replica.
4. A method of estimating a speech sound voicing parameter vim) and a speech sound pitch parameter p(m) characterizing a speech signal s(n), said method characterized by: (a) filtering sin) to remove high frequency signal components therefrom and to produce a filtered replica thereof (18); (b) magnitude expanding (18) said filtered replica to produce a magnitude expanded and filtered signal xin);
(c) locating (22) a largest magnitude signal peak within xin);
(d) deriving (32) a template yin) comprising a portion of xQi) containing said largest magnitude signal peak;
(e) cross-correlating (34) yin) across xin) to produce r_y(£);
(f) comparing (36) r^ik) and a predefined first peak detection threshold value to locate, within r^(£), a set of first signal peaks having a peak magnitude value exceeding said first peak threshold detection value;
(g) determining (42) a set of first peak separation distances between each adjacent pair of said first signal peaks; and,
(h) if more than two of said first signal peaks are located, and if said first peak separation distances are substantially equal, setting pirn) equal to an average of said first peak separation distances and setting v( ) to indicate that sin) is voiced (48).
5. A method as defined in claim 4, further characterized by, if more than two of said first signal peaks are located, and if said first peak separation distances are not substantially equal:
(a) comparing (156) r (£) and a predefined third peak detection threshold value to locate, within r^ik), a set of third signal peaks having a peak magnitude value exceeding said third peak threshold detection value; and, (b) if equal numbers of said first and said third signal peaks are located, and if each one of said first signal peaks coincides in position with one of said third signal peaks, setting pirn) equal to said first peak separation distance between a most recently located and adjacent pair of said first signal peaks and setting v( ) to indicate that sQϊ) is voiced (164).
6. A method as defined in claim 5, further characterized by, if equal numbers of said first and said third signal peaks are not located, or if each one of said first signal peaks does not coincide in position with one of said third signal peaks: (a) comparing (168) a largest one of said first peak separation distances with a smallest one of said first peak separation distances; and, (b) if said largest one of said first peak separation distances is substantially equal to an integer multiple of said smallest one of said first peak separation distances, setting pirn) equal to a weighted average of said largest and said smallest ones of said first peak separation distances and setting vim) to indicate that sin) is voiced (172).
7. A method as defined in claim 6, further characterized by if said largest one of said first peak separation distances is not substantially equal to said integer multiple of said smallest one of said first peak separation distances, setting v( ) to indicate that sin) is not voiced (170).
8. A method as defined in claim 4, further characterized by, if exactly two of said first signal peaks are located:
(a) comparing (134) r^ik) and a predefined second peak detection threshold value to locate, within r^,(£), a set of second signal peaks having a peak magnitude value exceeding said second peak threshold detection value;
(b) determining (136) a set of second peak separation distances between each adjacent pair of said second signal peaks; and,
(c) if more than two of said second signal peaks are located, and if said second peak separation distances are substantially equal, setting pirn) equal to an average of said second peak separation distances and setting v( ) to indicate that sin) is voiced (142).
9. A method as defined in claim 8, further characterized by, if more than two of said second signal peaks are located, and if said second peak separation distances are not substantially equal: (a) comparing (150) a largest one of said second peak separation distances with a smallest one of said second peak separation distances; and, (b) if said largest one of said second peak separation distances is substantially equal to an integer multiple of said smallest one of said second peak separation distances, setting pirn) equal to a weighted average of said largest and said smallest ones of said second peak separation distances and setting v(m) to indicate that sin) is voiced (154).
0. A method as defined in claim 9, further characterized by if said largest one of said second peak separation distances is not substantially equal to an integer multiple of said smallest one of said second peak separation distances, setting v( ) to indicate that sin) is not voiced (152).
11. A method as defined in claim 8, further characterized by if exactly two of said first signal peaks are located and if exactly two of said second signal peaks are located and if values of pirn) determined in respect of voiced speech frames which immediately precede a current speech frame signify that said current speech frame contains exactly two signal peaks, setting pirn) equal to said separation distance between said exactly two of said second signal peaks and setting v( ) to indicate that sin) is voiced (176).
12. A method as defined in claim 4, further characterized by, if less than two of said first signal peaks are located, setting v(m) to indicate that sin) is not voiced (40).
13. A method as defined in claim 11, further characterized by if said values of pirn) determined in respect of said voiced speech frames which immediately precede said current speech frame signify that said current speech frame contains more than two signal peaks: (a) comparing (182) r^(£) and a predefined third peak detection threshold value to locate, within r^,(£), a set of third signal peaks having a peak magnitude value exceeding said third peak threshold detection value;
(b) determining (184) a set of third peak separation distances between each adjacent pair of said third signal peaks; and,
(c) if the number of said third signal peaks exceeds the number of said first signal peaks by more than one and if said third peak separation distances are substantially equal, setting pirn) equal to an average of said third peak separation distances and setting v( ) to indicate that sin) is voiced (190).
14. A method as defined in claim 13, further characterized by if the number of said third signal peaks does not exceed the number of said first signal peaks by more than one and if the number of said third signal peaks does not exceed two, setting pirn) equal to said separation distance between said exactly two of said first signal peaks and setting v( ) to indicate that sin) is voiced (208).
15. A method as defined in claim 13, further characterized by if the number of said third signal peaks does not exceed the number of said first signal peaks by more than one and if the number of said third signal peaks exceeds two:
(a) comparing (212) a largest one of said third peak separation distances with a smallest one of said third peak separation distances; and,
(b) if said largest one of said third peak separation distances is substantially equal to an integer multiple of said smallest one of said third peak separation distances, setting pirn) equal to a weighted average of said largest and said smallest ones of said third peak separation distances and setting v( ) to indicate that sin) is voiced (216).
16. A method as defined in claim 15, further characterized by if said largest one of said third peak separation distances is not substantially equal to said integer multiple of said smallest one of said third peak separation distances, setting v( ) to indicate that s i) is not voiced (214).
17. A method as defined in claim 13, further characterized by if said third peak separation distances are not substantially equal, and if at least two of said third peak separation distances are not approximately equal, setting v( ) to indicate that s ϊ) is not voiced (196).
18. A method as defined in claim 13, further characterized by if said third peak separation distances are not substantially equal, and if at least two of said third peak separation distances are approximately equal, and if said approximately equal third peak separation distances are not substantially equal to said first peak separation distance, setting vim) to indicate that siμ) is not voiced (202).
19. A method as defined in claim 13, further characterized by if said third peak separation distances are not substantially equal, and if at least two of said third peak separation distances are approximately equal, and if said approximately equal third peak separation distances are substantially equal to said first peak separation distance, setting pirn) equal to an average of said approximately equal third peak separation distances and setting v( ) to indicate that sin) is voiced (204).
20. An electronic signal in a low bit rate speech coder, said signal characterized by a filtered, magnitude expanded and transformed replica r^ik) of a speech signal sin), said replica having a plurality of substantially equal magnitude peaks with each adjacent pair of said peaks separated by a distance corresponding to one pitch cycle of sin) if sin) is voiced, said replica for use by said speech coder to derive a speech sound voicing parameter virn) characterizing sin) and a speech sound pitch parameter pirn) characterizing sin).
21. An electronic signal as defined in claim 20, wherein r^ik) is derived by correlating a filtered and expanded replica xin) of sQi) and a template yin) comprising a portion of x i) containing a largest magnitude signal peak of x i).
22. An electronic signal as defined in claim 21, wherein said correlating further comprises cross-correlating yin) across xin).
23. An electronic signal as defined in claim 22, wherein xin) is further characterized by a cubed replica of sin).
4. An electronic signal as defined in claim 23, wherein xQϊ) is further characterized by a low pass filtered replica of sin).
PCT/CA2000/000364 1999-08-17 2000-04-03 Pitch and voicing estimation for low bit rate speech coders WO2001013360A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU36512/00A AU3651200A (en) 1999-08-17 2000-04-03 Pitch and voicing estimation for low bit rate speech coders

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US37559199A 1999-08-17 1999-08-17
US09/375,591 1999-08-17

Publications (1)

Publication Number Publication Date
WO2001013360A1 true WO2001013360A1 (en) 2001-02-22

Family

ID=23481487

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA2000/000364 WO2001013360A1 (en) 1999-08-17 2000-04-03 Pitch and voicing estimation for low bit rate speech coders

Country Status (2)

Country Link
AU (1) AU3651200A (en)
WO (1) WO2001013360A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1566796A2 (en) * 2004-02-20 2005-08-24 Sony Corporation Method and apparatus for separating a sound-source signal and method and device for detecting pitch
CN108470564A (en) * 2018-04-03 2018-08-31 苏州欧孚网络科技股份有限公司 According to the artificial intelligence approach of audio identification personality characteristics
US10249315B2 (en) 2012-05-18 2019-04-02 Huawei Technologies Co., Ltd. Method and apparatus for detecting correctness of pitch period
US10482892B2 (en) 2011-12-21 2019-11-19 Huawei Technologies Co., Ltd. Very short pitch detection and coding

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996036041A2 (en) * 1995-05-10 1996-11-14 Philips Electronics N.V. Transmission system and method for encoding speech with improved pitch detection
WO1999010879A1 (en) * 1997-08-25 1999-03-04 Telefonaktiebolaget Lm Ericsson Waveform-based periodicity detector

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996036041A2 (en) * 1995-05-10 1996-11-14 Philips Electronics N.V. Transmission system and method for encoding speech with improved pitch detection
WO1999010879A1 (en) * 1997-08-25 1999-03-04 Telefonaktiebolaget Lm Ericsson Waveform-based periodicity detector

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ROUAT J ET AL: "A pitch determination and voiced/unvoiced decision algorithm for noisy speech", SPEECH COMMUNICATION,NL,ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, vol. 21, no. 3, 1 April 1997 (1997-04-01), pages 191 - 207, XP004059542, ISSN: 0167-6393 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1566796A2 (en) * 2004-02-20 2005-08-24 Sony Corporation Method and apparatus for separating a sound-source signal and method and device for detecting pitch
EP1566796A3 (en) * 2004-02-20 2005-10-26 Sony Corporation Method and apparatus for separating a sound-source signal and method and device for detecting pitch
EP1755111A1 (en) * 2004-02-20 2007-02-21 Sony Corporation Method and device for detecting pitch
CN100356445C (en) * 2004-02-20 2007-12-19 索尼株式会社 Method and apparatus for separating sound-source signal and method and device for detecting pitch
US8073145B2 (en) 2004-02-20 2011-12-06 Sony Corporation Method and apparatus for separating sound-source signal and method and device for detecting pitch
US10482892B2 (en) 2011-12-21 2019-11-19 Huawei Technologies Co., Ltd. Very short pitch detection and coding
US11270716B2 (en) 2011-12-21 2022-03-08 Huawei Technologies Co., Ltd. Very short pitch detection and coding
US11894007B2 (en) 2011-12-21 2024-02-06 Huawei Technologies Co., Ltd. Very short pitch detection and coding
US10249315B2 (en) 2012-05-18 2019-04-02 Huawei Technologies Co., Ltd. Method and apparatus for detecting correctness of pitch period
US10984813B2 (en) 2012-05-18 2021-04-20 Huawei Technologies Co., Ltd. Method and apparatus for detecting correctness of pitch period
US11741980B2 (en) 2012-05-18 2023-08-29 Huawei Technologies Co., Ltd. Method and apparatus for detecting correctness of pitch period
CN108470564A (en) * 2018-04-03 2018-08-31 苏州欧孚网络科技股份有限公司 According to the artificial intelligence approach of audio identification personality characteristics

Also Published As

Publication number Publication date
AU3651200A (en) 2001-03-13

Similar Documents

Publication Publication Date Title
Drugman et al. Glottal closure and opening instant detection from speech signals
EP3016314B1 (en) A system and a method for detecting recorded biometric information
Greenwood et al. SUVing: automatic silence/unvoiced/voiced classification of speech
US8942977B2 (en) System and method for speech recognition using pitch-synchronous spectral parameters
WO2001013360A1 (en) Pitch and voicing estimation for low bit rate speech coders
Reddy et al. Syllable nuclei detection using perceptually significant features
Kadiri et al. Speech polarity detection using strength of impulse-like excitation extracted from speech epochs
KR100735417B1 (en) Method of align window available to sampling peak feature in voice signal and the system thereof
Lin et al. A Novel Normalization Method for Autocorrelation Function for Pitch Detection and for Speech Activity Detection.
Sundaram et al. Usable Speech Detection Using Linear Predictive Analysis–A Model-Based Approach
Niyogi et al. A detection framework for locating phonetic events.
Kodukula Significance of excitation source information for speech analysis
Vishnubhotla et al. Automatic detection of irregular phonation in continuous speech
SI25265A (en) The process and the device for marking the period of speech pitch and audio/non-audio segments
Jayan et al. Detection of burst onset landmarks in speech using rate of change of spectral moments
Jena et al. Gender classification by pitch analysis
Tsiartas et al. Robust word boundary detection in spontaneous speech using acoustic and lexical cues
KR100194953B1 (en) Pitch detection method by frame in voiced sound section
Niederjohn et al. Computer recognition of the continuant phonemes in connected English speech
Abhiram et al. A fast algorithm for speech polarity detection using long-term linear prediction
CN110827859B (en) Method and device for vibrato recognition
Chen et al. Place of articulation cues for voiced and voiceless plosives and fricatives in syllable-initial position.
Yoon et al. Detecting non-modal phonation in telephone speech
JPS59149400A (en) Syllable boundary selection system
KR100211965B1 (en) Method for extracting pitch synchronous formant of voiced speech

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP