GB2400003A

GB2400003A - Pitch estimation within a speech signal

Info

Publication number: GB2400003A
Application number: GB0306669A
Authority: GB
Inventors: Halil Fikretler; Jonathan Gibbs
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 2003-03-22
Filing date: 2003-03-22
Publication date: 2004-09-29
Anticipated expiration: 2023-03-22
Also published as: GB0306669D0; GB2400003B

Abstract

A method of sample-by-sample pitch estimation within a speech signal 100, comprising a) selecting sample points in the speech signal as candidate positions 102-104, b) estimating candidate pitches at each of the plurality of candidate positions 118-136, c) refining candidate pitches to sub-integer pitch estimates, d) selecting from among said candidate pitches at each of the plurality of candidate positions, e) interpolating between pitches selected at each of the plurality of candidate positions (162-166, fig 2). The method is characterised by the use of linear predictive coding inverse filtration of the speech signal 102, peaks of the resulting filtered speech signal being selected as the candidate positions. The method may include the use of reliability indicators and thresholds to aid selection of suitable peaks. Both low pass and band pass filtration may be used, the signal may be squared to emphasise peaks and both correlation and autocorrelation may be used to identify pitch peaks within the input speech signal. Further optional features include using a cubic b-spline to estimate non-integer peak positions, and interpolating using a polynomial function.

Description

Pitch Estimation Within A Speech Signal

Technical Field

This invention relates to speech parameterisation in communications systems, in particular pitch estimation within a speech signal. The invention is applicable to, but not limited to, wideband speech coding using prototype waveform interpolation (PWI) based techniques.

Background

Many present day speech communications systems operating according to industry defined standards, such as the global system for mobile communications (GSM) cellular telephony standard and the TErrestrial Trunked RAdio (TETRA) system for private mobile radio users, use speechprocessing units to encode and decode speech patterns. In such voice communications systems a speech encoder in a transmitting unit converts the analogue speech pattern into a suitable digital format for transmission. A speech decoder in a receiving unit converts a received digital speech signal into an audible analogue speech pattern. Thus, a digital speech communication system typically uses a speech encoder to produce a parsimonious representation of the analogue speech signal. A corresponding decoder is used to generate an approximation to the speech signal from that representation. The combination of the encoder and decoder is known in the art as a speech codec.

As frequency spectrum for such voice communication systems is a valuable resource, it is desirable to limit the channel bandwidth used by such speech signals, in order to maximise the number of users per frequency band. Hence, a primary objective in the use of speech coding techniques is to reduce the occupied capacity of the speech patterns as much as possible, by use of compression techniques, without losing fidelity. One such technique, waveform interpolation, will now be discussed.

As will be apparent to a person skilled in the art, many segments of speech signals contain quasi-periodic waveforms. In particular, it is well known that during voiced segments the speech signal is nearly periodic.

Thus, when commencing from any time instant, it is relatively easy to identify a first pitch cycle, a second pitch cycle and so on. Notably, when comparing a sequence of such pitch cycle waveforms, it is possible to observe that the general shape of the waveform evolves over time.

This slow evolution of the waveform over time led some researchers to extract a pitch cycle waveform at regular time intervals and obtain a good approximation of the intermediate pitch cycle by means of interpolation. This approach led to the speech-processing concept of waveform interpolation (WI).

For voiced speech, the pitch cycle waveform effectively describes the essential characteristics of the speech signal. Thus, a speech signal can be re-constructed, without distortion, if the pitch cycle waveform and the phase are known at each instance of time. Although such a technique lends itself readily to voiced signals, it is also applicable to non- periodic unvoiced signals. However, whilst the pitch cycle waveform evolves slowly for voiced speech, i.e. there is a slow rate of change of repetitive components such as pitch and harmonic components of pitch, it evolves rapidly for unvoiced speech. Hence, the quasi- periodic component of the speech signal corresponds to a slowly evolving waveform (SEW) component, whereas a non- periodic ("noisy") component of the speech signal corresponds to a rapidly evolving waveform (REW) component.

The inventors of the present invention have recognised and appreciated significant limitations in the use of Waveform Interpolation, particularly for high bit-rate applications.

In particular, an important element of a speech codec is the approach it takes to reconstruct consecutive cycles of quasi-periodic waveforms. Frequently, correlation is exploited by transmitting a single cycle of the waveform, or of a filtered version of the waveform, only once every 20-30 msec. In this manner, a portion of the data is missing in the received signal. The standard approach to deal with the missing data in the decoder is to linearly interpolate between samples of the transmitted cycles.

In practice, the use of linear interpolation by a speech decoder to generate data between the transmitted cycles only produces an adequate approximation to the speech signal if the speech signal really is quasiperiodic, or, equivalently, if the vectors representing consecutive cycles of the waveform evolve sufficiently slowly.

However, many segments of speech contain noisy non-periodic signal components. This results in comparatively rapid evolution of the waveform cycles. In order for waveform interpolation in an encoder to be useful for such signals, it is necessary to extract accurately a sufficiently quasi periodic component from the noisy signal in the encoder.

Several possible methods to extract a quasi-periodic component from the noisy signal exist: i) Linear low pass filtering a sequence of vectors representing consecutive cycles of speech in the time dimension using finite impulse response (FIR) filters is well known in the speech coding literature. The difficulty with this approach is that in order to get good separation of the slowly and rapidly evolving components, the low pass filter frequency response must have a sharp tail or roll- off. This requires a long impulse response, which necessitates an undesirably large filter delay. Hence, the FIR approach is of limited practical use in a wideband speech coding context for interactive conversational applications.

ii) A Kalman filter technique for estimating the quasi periodic signal components has been described by Gruber and Todtli (IEEE Trans Signal Processing, Vol. 42, No. 3, March 1994, pp 552-562). However, because this Kalman filter technique is based on a linear dynamic system model of a frequency domain representation of the signal, it is unnecessarily complex. It also assumes that the dynamic system model parameters (i.e. noise energy and the harmonic signal gain) are known. However, when considering speech coding, noise energy and the harmonic signal gain parameters are not known. In its favour, the Kalman filtering method has promise in that the quasi-periodic and non-periodic parts of the speech waveform can be estimated with little or well defined delay.

To implement such a low delay form of the Kalman filter requires a low delay sample-by-sample pitch estimator.

Dynamic programming is widely used for sample-by-sample pitch estimation, as it is robust and capable of sub- integer resolution. However it suffers from delay, typically requiring 60 milliseconds (approximately 3 frames) look-ahead.

As a consequence, a need has arisen for an alternative low- delay sample-by-sample pitch estimation method.

Summary of the Invention

In accordance with a first aspect of the invention, there is provided a sample-by-sample pitch estimation method, as claimed in claim 1.

In accordance with a second aspect of the invention, there is provided a sample-by-sample pitch estimation apparatus, as claimed in claim 23.

Further features of the present invention are as claimed in the dependent claims.

Brief description of the drawings

An exemplary embodiment of the present invention will now be described, with reference to the accompanying drawings: FIG. 1 is the first half of a flowchart of a pitch estimation method adapted to support the various inventive concepts of the present invention.

FIG. 2 is the second half of a flowchart of a pitch estimation method adapted to support the various inventive concepts of the present invention.

FIG. 3 is an apparatus for the implementation of a pitch estimation method adapted to support the various inventive concepts of the present invention.

FIG. 4A illustrates the selection of candidate peaks and correlation windows from the raw input.

FIG. 4B illustrates the raw input and corresponding sample- by-sample estimated pitch output. )

Detailed description of the invention

In summary, the inventors of the present invention have proposed a pitch estimation method and implementation employing the following basic steps; A. Select candidate positions within input signal; B. Estimate instantaneous pitches at said positions C. Refine to sub-integer pitch estimates; D. Pick the pitch and position candidates to create a smooth pitch track; and E. Interpolate between said candidate pitches and candidate positions.

Advantageously, the inventors of the present invention have developed a method of evaluating the consistency of pitch estimates that avoids longdelay methods such as dynamic programming, based upon the selection of candidate positions from an LPC inverse-filtered speech signal, and the further selection and identification of harmonic families of possible pitches at each candidate position, the relative size of these families being the basis of the consistency evaluation. This method therefore only requires the information within the relevant speech frame to estimate the pitch track, with look-ahead and look-back frames only used if smooth interpolation between speech frames is desired.

In an embodiment of the invention, the input speech signal is presented as frames in the order of 20ms long. Full pitch analysis is performed on the frame prior to the current frame (the prior frame hereinafter referred to as the analysis frame). The current frame serves as a look- ahead buffer for the analysis frame. The output of N previous analyses is used as a look-back buffer, where N is typically 2.

J

Steps A-C are applied to both the analysis frame and look ahead buffer. Steps D and E are applied to the analysis frame only, using information from the look-back buffer as required. The aim of the method is to generate a sample-by sample estimate of a speech pitch in the signal, hereinafter referred to as a smooth pitch track.

In an embodiment of the invention, Step A comprises the following steps; A(i) The input speech signal (402) is LPC inverse filtered.

- This enhances periodicity & harmonic structure.

A(ii) The residual speech signal produced by step A(i) is band-pass filtered, the pass range typically being in the order of 70-9OOHz.

- This enhances speech in noisy environments.

A(iii) The filtered speech signal is raised to a power, typically 2.

- This enhances speech peaks.

A(iv) This signal is low-pass filtered.

- The resulting smoothed filtered speech provides an envelope for the quasi-periodic part of the signal.

A(v) The maximum of the energy envelope is found.

Other peaks that are above a relative threshold to this maximum and are at least a minimum pitch value apart from neighbouring peak positions are then found.

A(vi) The peaks found in step A(v) are sorted by their positions.

A(vii) Peaks 30% smaller than both adjacent peaks or 70\ smaller than either adjacent peak are eliminated.

- This helps to reduce halving errors. )

The positions of the remaining peaks are hereinafter referred to as the 'candidate positions' (404).

A(viii) A correlation window is then determined for each candidate position (406). The start and stop points for this window are set to be the middle points between candidate positions or the bounds of the analysis frame as appropriate.

- The correlation windows are used in step B. The outputs of step A are thus a collection of candidate positions distributed over the time history of the input speech, together with corresponding correlation windows to be applied at each respective candidate position in step B. In an embodiment of the invention, Step B comprises the following steps; B(i) For each candidate position, forward and backward normalised correlations are calculated using the respective correlation window determined in step A(viii).

B(ii) For subsequent steps the backward normalised correlations are used in preference unless the normalised correlation peak is not strong (typically less than 0.8) and the forward normalization correlation is higher (e.g. at speech on-sets) B(iii) For each candidate position, typically up to seven candidate peaks are selected from the correlation chosen in B(ii).

B(iv) For each candidate position the correlation chosen in step B(ii) is auto-correlated using a pitch-length window.

B(v) For each candidate position, typically up to three candidate peaks are selected from the resulting auto-correlation.

B(vi) For each candidate position, candidate peaks from the normalized correlation and the auto correlation (correlation of the normalised correlation) are compared. If candidate peaks from the auto-correlation are similar to those of the normalised correlation they are removed.

B(vii) Candidate peaks correspond to pitch frequencies.

For each candidate position the remaining candidate peaks are grouped in terms of harmonics of a given pitch frequency, by looking for approximate integer multiples of corresponding pitch.

B(viii) The power of each resulting group is weighted by taking the average of the power of each candidate peak in the group and multiplying by (group population) / (group population + 2).

- Groups containing a pitch with more identified harmonics will therefore gain a higher weighting.

B(ix) Groups that are 40% less powerful than the best group as calculated in step B(viii) are eliminated.

B(x) The remaining groups are taken to represent the candidate pitches. The reliability of each candidate pitch is dependent upon the difference in power as calculated in step B(viii) between that candidate pitch and the strongest candidate pitch. This information is used for consistency checking in step D. The outputs of step B are thus harmonically grouped normalised correlations and correlations of correlations that embody candidate pitch frequencies, having an associated reliability indication, for each candidate position.

In an embodiment of the invention, Step C comprises the following steps; C(i) A cubic B-spline is fitted to the candidate peaks selected in step B. enabling non-integer pitch correlations to be calculated using spline interpolation.

C(ii) Each fitted spline is used as a function input to an algorithm identifying local maxima (for example the Brent Algorithm), which outputs the non-integer maximum peak position.

The outputs of step C are thus non-integer refinements of the candidate pitch frequencies produced in step B. Together, the outputs of Steps B and C provide i) candidate non-integer pitch estimates; and ii) an indication of the reliability of each candidate pitch estimate dependent upon the number of harmonics associated with that candidate pitch estimate; for each candidate position.

In an embodiment of the invention, Step D comprises the following steps; D(i) The overall most reliable candidate pitch estimate from each of the analysis frame and look-back buffer are selected as definite points in time and frequency (hereinafter referred to as reference points') through which an eventual I pitch track will pass.

D(ii) If the respective pitch values of the reference points selected in step D(i) exceed a difference Al

-

threshold but both have a high reliability indication, a pitch transition flag is set on.

D(iii) If the respective pitch values of the reference points selected in step D(i) exceed a difference threshold and one reference point has a low reliability indication, that one is removed unless an alternative candidate pitch from the same candidate position has a sufficiently close pitch value to the other reference point.

D(iv) For each candidate position not selected as a reference point, a comparison dependent upon both pitch value and reliability is made between each candidate pitch and the one of the reference points, to select pitch values that form a smooth pitch track. For said comparison, candidate positions use the closest reference point unless positioned between two reference points, where the following applies; chronologically, candidate positions between the reference points switch from using the earlier reference point to the later reference point, when at a given candidate position the most reliable candidate pitch is found to be closer to the pitch value of the later reference point. If no candidate pitch value for a candidate position could contribute to a smooth pitch track, the candidate position is removed.

D(v) If the transition flag is on, the number of pitch candidates similar to each reference point is counted. If a reference point has no similar pitch candidates, it is removed, the transition flag is set off and step D(iv) is repeated with only one reference point.

The outputs of step D are thus a final non-integer pitch estimate at each of those candidate positions where such estimates could provide a smooth pitch track.

In an embodiment of the invention, Step E comprises the following steps; E(i) Candidate positions are shifted by half the pitch period predicted by their corresponding pitch estimate, the shift being forward if backward correlations were selected in step B(ii) for that candidate position, or backward if forward correlations were selected in step B(ii) for that candidate position.

E(ii) A polynomial is applied over all the candidate positions to fit to their pitch estimates. In a preferred embodiment, a hermite polynomial is used to minimise estimation overshoot.

E(iii) A sample-by-sample pitch track is obtained by interpolation of the polynomial fit.

The output of step E is thus a sample-by-sample estimated pitch track (414) extending over the analysis frame and into the previous analysis frame of the look-back buffer, allowing for a smooth transition between subsequent frames of the pitch track.

Such a pitch estimation technique is suitable for the low- delay Kalman filtering method, although it will be clear to a person skilled in the art that such pitch estimation may be suitable in any situation requiring sample-by-sample pitch estimation, and particularly advantageous where delay is to be kept at a minimum.

Figure 3 illustrates an apparatus 300 for implementing an embodiment of the present invention, comprising a means for ) 13 the implementation of step A (302), a means for the implementation of step B (304), a means for the implementation of step C (306), a means for the implementation of step D (310), a memory means (308) and a means for the implementation of step E (312). Note that it will be clear to a person skilled in the art that alternative implementations are possible, such as by use of a microprocessor, input and output means, a memory means and microprocessorimplementable instructions for the implementation of steps A-E.

Claims

CML00484EV/GB/DRE/Gibbs Claims.

1. A method of sample-by-sample pitch estimation within a; speech signal, comprising steps A to E wherein A) Selecting sample points in the speech signal as candidate positions; B) Estimating candidate pitches at each of the plurality of said candidate positions; C) Refining said candidate pitches to sub-integer pitch estimates; D) Selecting from among said candidate pitches at each of the plurality of said candidate positions; and E) Interpolating between pitches selected at each of the plurality of said candidate positions, characterized by; during step A, selecting said candidate positions over a time history from peaks obtained using linear predictive coding inverse filtration of the speech signal.

2. A method according to claim 1, wherein the selection of candidate pitches includes defining a reliability for each candidate pitch at each candidate position according to the number of harmonics of said pitch identified by pitch estimation means at said candidate position.

3. A method according to any one of the preceding claims, wherein the signal is additionally band-pass filtered and low pass filtered following LPC inverse filtration and prior to the determination of selected sample points as candidate positions.

4. A method according to claim 3, wherein the signal is additionally raised to a power, typically 2.

À À À a À À. ^ e a. * a e e À. À À e a À À À .

5. A method according to any one of the preceding claims, wherein the determination of selected sample points as candidate positions includes the identification of the maximum value of the filtered signal, other peak values above a relative threshold to this maximum also being selected unless they are respectively; i) below a given threshold smaller than both adjacent peak values, typically 30%, ii) below a given threshold smaller than any adjacent peak value, typically 70%, or iii) less than a specified number of samples away from a larger peak value.

6. A method according to any one of the preceding claims, wherein a correlation window is determined for each candidate position, the start and stop points for each window being set to be the midway between the candidate position and respective adjacent candidate positions, or at the bounds of the analysis frame, as appropriate.

7. A method according to any one of the preceding claims, wherein for each candidate position backward and forward normalised correlations are generated for selection, defaulting to the selection of the backward correlation if said backward correlation has a normalized correlation peak exceeding a given threshold, each selected backward or forward normalized correlations hereinafter referred to as a 'selected correlation'.

8. A method according to claim 7, wherein for each candidate position the associated selected correlation is auto-correlated to produce an autocorrelation.

9. A method according to claim 8, wherein for each candidate position, candidate peaks are selected from the associated selected correlation and autocorrelation, deselecting any candidate peak from the autocorrelation that has a position value within a given threshold of the position value of any candidate peak from the selected correlation, the remaining selected candidate peaks each being indicative of a pitch or harmonic thereof at the associated candidate position.

10. A method according to claim 9, wherein for each candidate position, a cubic B-spline is fitted to each of the associated selected correlation and autocorrelation, the resulting cubic B-splines being used to estimate non integer candidate peak positions by a maxima estimating means, the resulting non-integer candidate peak positions each being indicative of a non-integer pitch or harmonic thereof at the associated candidate position.

11. A method according to claim 10, wherein for each candidate position any associated candidate peak positions are grouped by harmonic relationship to one another, each resulting group representing a candidate pitch, the reliability of the estimated value of said candidate pitch being defined as a function of the number of members in said resulting group.

12. A method according to claim 11, wherein the function used to define the reliability of a candidate pitch is the average of the power of the peaks in the group representing the candidate pitch, multiplied by (member size of said group/member size of said group + 2). ) 17

13. A method according to any one of claims 2-12, wherein for each of the current and previous frame, having been stored, the candidate pitch with the highest reliability over the total of the plurality of candidate positions is selected, said candidate pitch and associated candidate position defining a reference point.

14. A method according to claim 13, wherein if the respective pitch values of the aforesaid reference points differ in excess of a given threshold and both have a reliability exceeding a given threshold, a pitch transition flag is set on.

15. A method according to any one of claims 13-14, wherein if the respective pitch values of the aforesaid reference points differ in excess of a given threshold and one has a reliability that does not exceed a given threshold, that one is removed unless an alternative candidate pitch from the same candidate position has a sufficiently close pitch value to the other reference point.

16. A method according to any one of claims 13-15, wherein for each candidate position not selected as a reference point, a comparison dependent upon both pitch and reliability to select pitch values that form a smooth pitch track is made between each candidate pitch and one of the reference points, for each candidate position the closest reference point being chosen for comparative purposes unless the candidate position is between two reference points, in which case the earlier reference point is used for positions prior to that candidate position for which a more reliable pitch exists that is closer in value to the later reference point.

17. A method according to claim 16, wherein if no candidate pitch value for a given candidate position could contribute to a smooth pitch track then the candidate position is removed.

18. A method according to any one of claims 16-17, wherein if the transition flag is on, the number of pitch candidates over the totality of candidate positions that are similar to each reference point is counted. If a reference point has no similar pitch candidates, it is removed and the method of claims 16-17 is repeated for the remaining reference point only.

19. A method according to any one of claims 7-18 wherein candidate positions and where defined, reference points are shifted in time by half the pitch period predicted by their corresponding pitch estimate, the shift being forward if backward correlations were selected in the method of claim 7 for that candidate position, or backward if forward correlations were selected in the method of claim 7 for that candidate position.

20. A method according to any one of the preceding claims, wherein a polynomial is applied over all the candidate positions to fit to their pitch estimates and a sample-by- sample pitch track is obtained by interpolation of the polynomial fit.

21. A method according to claim 20, wherein the polynomial is a hermite polynomial.

22. A method according to any one of the preceding claims, wherein the estimated pitch track is input to a prototype waveform interpolation or a Kalman filter method.

23. An apparatus for sample-by-sample pitch estimation within a speech signal, comprising A) Selection means to select sample points as candidate positions (302); B) Estimation means to determine candidate pitches at each of the plurality of said candidate positions (304); C) Refinement means to generate sub-integer pitch estimates of said candidate pitches (306); D) Selection means to determine a candidate pitch at each of the plurality of said candidate positions (310); and E) Interpolation means to produce a smooth pitch track from selected pitches at each of the plurality of said candidate positions (312), characterized by; means for applying linear predictive coding inverse filtration to a frame of the speech signal and a subsequent selection means to select peaks in said filtered signal as sample points.

24. An apparatus for sample-by-sample pitch estimation within a speech signal according to claim 23, wherein the selection means for candidate pitches is further characterized by; means for applying a reliability indication to each candidate pitch at each candidate position dependent upon the number of harmonics of said pitch identified by pitch estimation means at said candidate position.

25. An apparatus for sample-by-sample pitch estimation within a speech signal according to any of claims 23-24, wherein said apparatus may be operably coupled to prototype waveform interpolation based means or a Kalman filter.

I

26. A method of sample-by-sample pitch estimation within a speech signal substantially as hereinbefore described with reference to, and/or as illustrated by, the accompanying drawings.