CA1243122A

CA1243122A - Processing of acoustic waveforms

Info

Publication number: CA1243122A
Application number: CA000504354A
Authority: CA
Inventors: Robert J. Mcaulay; Thomas F. Quatieri, Jr.
Original assignee: Massachusetts Institute of Technology
Current assignee: Massachusetts Institute of Technology
Priority date: 1985-03-18
Filing date: 1986-03-18
Publication date: 1988-10-11
Also published as: AU5620886A; JP2759646B2; WO1986005617A1; AU597573B2; EP0215915A1; JPS62502572A; EP0215915A4

Abstract

Abstract of the Disclosure A sinusoidal model for acoustic waveforms is applied to develop a new analysis/synthesis technique which characterizes a waveform by the amplitudes, frequencies, and phases of component sine waves.
These parameters are estimated from a short-time Fourier transform. Rapid changes in the highly-resolved spectral components are tracked using the concept of "birth" and "death" of the underlying sine waves. The component values are interpolated from one frame to the next to yield a representation that is applied to a sine wave generator. The resulting synthetic waveform preserves the general waveform shape and is perceptually indistinguishable from the original. Furthermore, in the presence of noise the perceptual characteristics of the waveform as well as the noise are maintained. The method and devices disclosed herein are particularly useful in speech coding, time-scale modification, frequency scale modification and pitch modification.

Description

~2~L3~Z;~

.
The U.S. Government has rights in this invention pursuant to the Department of the Air Force Contract No. F19-028-80-C-0002.

Technical Field The field of this invention is speech technology generally and, in particular, methods and devices for analyzîng, digitally-encoding, modifying and synthesizing speech or other acoustic waveforms.

Background of the Invention Typically, the problem of representing speech signals is approached by using a speech production model in which speech is viewed as the result of passing a glottal excitation waveform through a time-varying linear filter that models the resonant characteristics of the vocal tract. In many speech applications it suffices ~o assume that the glottal excitation can be in one of two possible states corresponding to voiced or un~oiced speech.
In the voiced speech state the excitation is periodic with a period which is allowed to vary slowly over time relative to the analysis frame rate (typically 10-20 msecs). For the unvoiced speech state the glottal excitation is modeled as random noise with a flat spectrum. In both cases the power level in the excitation is also considered to be slowly time-varying.

, I ' .

-~4L3~22 1 While this binary model has been used successfully to design narrowband vocoders and speech synthesis systems, its limitations are well known.
For example, often the excitation is mixed having both voiced and unvoiced components simultaneously, and often only portions of the spectrum are truly harmonic. Furthermore9 the binary model requires that each frame of data be classified as either voiced or unvoiced, a decision which is particularly difficult to make if the speech is also subject to additive acoustic noise.
Speech coders at rates compatible with conventional transmission lines (i.e. 2.4 - 9.6 kilobits per second) would meet a substantial need.
At such rates the binary model is ill-suited for coding applications. Additionally, speech processing devices and methods that allow the user to modify various parameters in reconstructing waveform would find substantial usage. For example, time-scale modification (without pitch alteration) would be a very useful feature for a variety of speech appl~cations (i.e. slowing down speech for translation purposes or speeding it up for scanning purposes~ as well as for musical composition or analysis. Unfortunately, time-scale (and other parameter) modifications also are not accomplished with high quality by devices employing the b~nary model.
Thus, there exists a need for better methods and devices for processing audible waveforms. In particular, speech coders operable at mid-band rates and in noisy environments as well as synthesizers capable of maintaining their perceptual quality of 1 speech while changing the rate of articulation would satisfy long-felt needs and provide substantial contributions to the art.

Summary of the Invention It has been discovered that speech analysis and synthesis as well as coding and time-scale modifica~ion can be accomplished simply and effectively by employing a time-frequency representation of the speech waveForm which ls independent of the speech state. Specifically, a sinusoidal model for the speech waveform is used to develop a new analysis-synthesis technique.
The basic method of the invention includes the steps of: (a) selecting frames (i.e. windows of about 20 - 40 milliseconds) of samples from the waveform; (b) analy~ng each frame of samples to extract a set of frequency components; (c) tracking the components from one frame to the next; and (d) interpolating the values of the components from one frame to the next to obtain a parametrlc representation of the waveform. A synthetic waveform can then be constructed by generating a series of sine waves corresponding to the parametric representation.
In one simple embodiment of the invention, a device is disclosed which uses only the amplitudes and frequencies of the component sine waves to represent the waveform. In this so-called "magnitude-only" system, phase continuity is maintained by defining the phase to be the integral of the instantaneous frequency. In a more comprehensive embodiment, expl~cit use is made of the measured phases as well as-the amplitudes and frequencies of the components.

iL2~3~Z2 1 The invention is particularly useful in speech coding and time-scale modification and has been demonstrated successfully in both of these applications. Robust devices can be built according to the invention to operate in environments of additive acoustic noise. The invent10n also can be used to analyze single and multiple speaker signals~
music or even biological sounds. The invention wi11 also find particular applications, for example, in reading machines for the blind, in broadcast journalism editing and in transmission of music to remote players.
In one illustrated embodiment of the invention, the basic method summarized above is employed to choose amplitudes, freqwencies, and phases corresponding to the largest peaks in a periodogram of the measured signal, independently of the speech state. In order to reconstruct the speech waveform, the amplitudes, frequencies, and phases of the sine waves estimated on one frame are matched and allowed to cont~nuously evolve into the corresponding parameter set on the successive frame. Because the number of estimated peaks are not constant and slowly varying, the matching process is not straightforward.
Rapidly varying regions of speech such as unvo~ced/voiced transitions can result in large changes in both the locat~on and number of peaks. To account for such rapid movements in spectral energy, the concept of "birth" and "death" of sinusoidal components is employed in a nearest~neighbor matching method based on the frequencies estlmated on each frame. If a new peak appears, a "birth" is sa~d to occur and a new track is ~nitiated. If an old peak is not matched, a "death1' is said to occur and the 3~2 1 corresponding track is allowed to decay to zero.
Once the parameters on successive frames have been matched, phase con~inuity of each sinusoidal component is ensured by unwrapping the phase. In one preferred embodiment the phase is unwrapped using a cubic phase interpolation function having parame~er values that are chosen to satlsfy the measured phase and frequency constraints at the frame boundaries while maintaining maximal smoothness over the frame duration. Finally, the corresponding sinusoidal amplitudes are simply interpolated in a linear manner across each frame.
In speech coding applications, pitch estimates are used to establish a set of harmonic frequency bins to which the ~requency components are assigned. (Pitch is used herein to mean the fundamental rate at whlch a speakerls vocal cords are vibrating). The amplitudes of the components can be coded directly using adaptive pulse code modulation 20 (ADPCM) across frequency or ind~rec~ly using linear predictive coding. In each harmonic frequency bin the peak hav~ng the largest amplitude is selected and assigned to the frequency at the center of the bin.
This results in a harmonic series based upon the 25 coded pitch period. The phases can then be coded by using the frequencies to predict phase at the end of the frame, unwrapping the measured phase with respect to this prediction and then coding the phase residual using 4 bits per phase peak. If there are not enough bits aYailable to code all of the phase peaks (e.g.
for low-p;tch speakers~, phase tracks for the high frequency peaks can be artificially generated. In one preferred embodiment, this is done by translating the frequency tracks of the base bands peaks to the ~L3~Z2 1 high frequency of the uncoded phase peaks. This new coding scheme has the important property of adaptively allocating the bits for each speaker and hence is self-tuning to both low- and high-pi~ched speakers. Although pitch is used to provide side information for the coding algor~thm, the standard voice-excitation model for speech is not used. This means that recourse is never made to a voiced-unvoiced decision. As a consequence the invention is robus~ in noise and can be applied at various data transmission rates simply by changing the rules for the bit allocation.
The invention is also well-suited for time-scale modification, which is accomplished by time-scaling the amplitudes and phases such that the frequency variations are preserved. The time-scale at which the speech is played back is controlled simply by changing the rate at which the matched peaks are interpolated. This means that the time-scale can be speeded up or slowed down by any factor and this factor can be time-varying. This rate can be controlled by a panel knob which allows an operator complete flex~b~lity for varying the time-scale. There is no perceptual delay in ~5 performing the time-scaling.
The invention will next be described in connection with certain illustrated embod~ments.
However, it should be clear that various changes and modifications can be made by those skilled in the art without departing from the spirit and scope of the invention. For example other sampling techniques can be substituted for the use of a variable frame length and Hamming window. Moreover the length of such frames and windows can vary in response to the

2~

particular application. Likewise3 frequency matching can be accomplished by various means. A variety of commercial devices are available to perform Fourier analysis; such analysis can also be performed by custom hardware or specially-designed pro~rams.
Yarious techniques for extracting pitch information can be employed. For example, the pi~ch period can be derived from the Fourier transform.
Other techniques such as the Gold-Malpass techniques ~o can also be used. See generally, M.L. Malpass, "The Gold Pitch Detector in a Real Time Environment" Proc.
of EASCON 1975 (Sept. 1975); B. Gold, "Description of a Computer Program for Pitch Detection", Fourth International Congress on Acoustics, Copenhagen August 21-28, 1962 and B. Gold, "Note on Buzz-Hiss Detection", J Acoust. Soc. Amer. 365, 1659-1661 (1964).
Yarious coding techniques can also be used interchangeably with those descr~bed below. Channel 2c encoding techniques are described in J.N. Holmes, "The JSRU Channel Vocoder", _E PROC, 27, 53-60 (1980)o Adaptive pulse code modulation is described in-L.R. Rabiner and R.W. Schafer D~gital Process~n~
of Signal, (Prentice Hall 1978). Linear pred~ctive 2~ cod~ng is described by J.D. Markel, Linear Prediction of Speech, (Springer-Yerlog, 1967).

It should be apprec~ated that the term "interpolation" ~s used broad1y in this applicat~on

3~ to encompass various techn~ques for fill~ng in data values between those measured at the frame boundaries. In the magnitude-only system linear interpolation ~s employed to fill in amplitude and ` frequency values. In this simple system phase values 3~Z~

1 are obtained by first defining a series of instantaneous frequency values by interpolaking matched frequency components from one frame to the next and then integrating the series of instantaneous frequency values to ob~ain a series of in~erpolated phase values. In the more comprehensive system the phase value of each frame is derived directly and a cubic polynomial equation preferably is employed to obtain maximally smooth phase interpolations from frame to frame.
Other techniques that accomplish the same purpose are also referred to in this application as interpolation techniques. For example, the so-called "overlap and add" method of filling in data values can also be used. In this method a weighted overlapping function can be applied to the result1ng sine waves generated during ea~h frame and ~hen the overlapped values can be summed to fill in the values between those measured at the frame boundaries.

20 Brief Descriptlon of the Drawings FIGURE 1 is a schematic block diagram of one embodiment of the invention in which only the magnitudes and frequencies of the components are used to reconstruct a sampled waveform.

FIGURE 2 is an illustration of the extracted amplitude and frequency components of a waveform sampled according to the present invention.

FIGURE 3 is a general illustration of the frequency matching method of the present invention.

.. . .

~æ~L3~2 1 FIGURE 4 is a detailed schematic illustration of a frequency mat~hing method according to the present invention.

FIGURE 5 is an illustration of tracked frequency components of an exemplary speech pattern.

FIGURE 6 ~s a schematic block diagram of another embodiment of the invention in which magnitude and phase of frequency components are used to reconstruct a sampled waveform.

FIGURE 7 is an illustrative set of cubic phase interpolation functlons for smoothing the phase functions useful in connection wi~h the embodiment of FIGURE 6 from which the "maximally smooth~ phase function is selected.

FIGURE 8 is a schematic block diagram of another embodiment of the invention particularly useful for time-scale modification.

FIGURE 9 is a schematic block diagram showing an embodiment of the sys~em estima~ion 20 function of FIGURE 8.

FIGURE 10 is a block diagram of one real-time implementation of the invention.

Detailed Description In the presen~ invention the speech waveform 25 is modeled as a sum of sine waves. If s(n) represents the sampled speech waveform then s(n) = ~a~(n)sin[~itn)] (1) ~3~

1 where aj(n) and dj(n) are the time-varying amplitudes and phases of the i'th tone.
In a simple embodiment the phase can be defined to be the ~nteqral of the instantaneous frequency fj(n) and therefore satisfies the recursion ~j(n) = ~j~n-1)+2~fj(n)/fs t2) where fs is the sampl~ng frequency. If the tones are harmonically related, then fj(n) = i*fO(n) (3) where fO(n) represents the fundamental frequency at time n. One particularly attractive property of ~he above model is the fact that phase continuity, hence waveform continuity, is guaranteed as a consequence 15 of the definition of phase in terms of the instantaneous frequency. This means that waveform reconstruction is possible from the magnitude-only spectrum since a high-resolution spectral analysis reveals the amplitudes and frequencies of the 2Q component sine waves.
A block diagram of an analysis/synthes~s system according to the ~nvention is illustrated in FIGURE 1. The peaks of the magnitude of the discrete Fourier transform (DFT) of a windowed waveform are 25 found simply by determln~ng the locations of a change in slope (concave down). In addition, the total number of peaks can be limited and this limlt can be adapted to the expected average pitch of the speaker.
In a simple embod~ment the speech waveform 30 can be d~gitized at a 10kHz sampling rate, low-passed filtered at 5 kHz, and analyzed at 20 msec frame ~29L3~22 1 intervals with a 20 msec Hamming window. Speech representations according to the invention can also be obtained by employing an analysis window of variable duration. For some applications It is preferable to have the width of the analysis window be pitch adaptive, being set, for example, at 2.~
times the average pitch period wit-h a minimum wid~h of 20 msec.
Plotted in FIGURE 2 is a typical periodogram for a frame of speech along with the amp1itudes and frequencies that are estimated using the above procedure. The DFT was computed using a 512-point fast Fourier transform (FFT). Different sets of these parameters will be obtained for each analysis frame. To obtain a representation of the waveform over time, frequency components measured on one frame must be matched wi~h those that are obtained on a successive frame.
FIGURE 3 illustrates the basic process of frequency component matching. If the number of peaks were constant and slowly varylng from frame to frame,' the problem of matching the parameters estimated on one frame with those on a successive frame would simply require a frequency ordered assignment of peaks. In practice, however, there will be spurious peaks that come and go due to the effects of sidelobe interaction; the locations of the peaks wil1 change as the pitch changes; and there will be rapid changes in both th~ location and the number of peaks 30 corresponding to rap~dly-varying reg~ons of speech, such as at voiced/unvoiced transftions. In order to account for such rapid movements in the spectral peaks, the present Invention employs the concept of "birth" and "death" of sinusoidal components as part 35 of the match~ng process.

~L3~L22 l The matching process is further explained by consideration o~ FIGURE 4. Assume that peaks up to frame k have been matched and a new parameter set for frame k+1 is generated. Let the chosen frequencies on frames k and k~1 be deno~ed by ~ok, ~k, ...
Wh 1 and ~ok 1, ~ 1, ...~M 1 respectively, where N and M represent the total number of peaks selected on each frame (N ~ M in general). One process of matching each frequency in frame k, ~n~
to some frequency in frame k+1, ~m . is given in the following three steps.
Step 1:
Suppose that a match has been found for frequencies ~k, ~1 ... wn 1 A match is now attempted for frequency ~n. FIGURE 4(a) depicts the case where all frequencies ~m 1 in frame k~1 lie outside a "matching intervall' ~ of ~k, i.e., ¦~k - ~m ~ (4) for all m. In this case the frequency track associated with ~n is declared "dead" on entering frame k~1, and ~n is matched to itself in frame k~1, but with zero amplitude. Frequency ~k is then eliminated from further consideration and Step 1 is repeated for the next frequency In the list, ~k+l.
If on the other hand there exists a frequency ~k 1 in frame k+1 that lies within the matching interval about ~k, and is ~he closest such frequency, i.e., ¦ k k+1¦ < ¦~k _ ~k+1¦ < A (5) 3~2 1 -for all i ~ m, then ~km1 is declared to be a candidate match to ~k. A defin~tive match is not yet made, since there may exist a better match in frame k to the frequency ~m 1, a contingency which is accounted for in Step 2.
Step 2:
In this step, a candidate match from Step 1 is confirmed. Suppose that a frequency ~n f frame k has been tentatively matched to frequency ~m f frame k+1. Then, if ~m has no better ma~ch to the remaining unmatched frequencies of frame k, then the candidate match is declared to be a definitive match. This condition, illustrated in FIGURE 4(c~, is given by ¦~m 1 ~nl < ¦~m 1 _ ~k+1¦ ~or i > n (6) When this occurs, frequencies ~n and ~k 1 are eliminated from further consideration and Step 1 is repeated for the next frequency in the list, ~k+1.
If the condition (6) is not satisfied, then the frequency ~m 1 in frame k~1 is better matched to the frequency ~n+1in frame k than it is to the test frequency ~n. Two addltional cases are then consldered. In the first case, illustrated in FIGURE

4(d), the adjacent rema~ning lower frequency ~k+
(if one exists) lies below the matching interval, hence no match can be made. As a result~ the frequency track associated with ~n is declared "dead" on entering frame k+1, and ~k is matched to itself with zero amplitude. In the second case, ~0 illustrated in FIGURE 4(e), the frequency ~k_1is within the matching interval about ~n and a definitive match is made. After either case Step 1 ~2~3~2~
_14-1 is repeated using the next frequency in the frame k list, ~n 1- It should be noted that many other situations are possible in this step, but to keep the tracker alternatives as simple as possible only the s two cases are discussed.
Step 3:
When all frequencies of Frame k have been tested and assigned to continuing tracks or to dying tracks, there may remain frequencies in frame k~1 for which no matches have been made. Suppose that ~km 1 is one such frequency, then it is concluded that ~k 1 was "born" in frame k and its match, a new frequency, ~km 1, is created ~n frame k with zero magnitude. This is done for all such unmatched frequencies. This last step is illustrated in FIGURE
4(f).
The results of applying the tracker to a segment of real speech is shown in FIGURE 5, which demonstrates the ability of the tracker to adapt ~o quickly through transitory speech behavior such as voiced/unvoiced transitions, and mixed voiced/unvoiced regions.
In the simple magnitude-only system, synthesis is accomplished in a straightforward 2s manner. Each pair of match frequencies (and their corresponding magnitudes) are linearly interpolated across consecutive frame boundaries. As noted above, in the magnitude-only system, phase continuity is guaranteed by the definition of phase in terms of the instantaneous frequency. The interpolated values are then used to drive a sine wave gener~tor which yields the synthetic waveform as shown in FIGURE 1. It should be noted that performance is improved by reducing the correlat~on window size, a, at h~gher frequencies.

43~L22 1 A further feature shown in FIGURE 1 (and discussed in detail be10w) is that the present invention is ideally suited for performing time-scale modification. From FIGURE 3 it can be seen that by simply expanding or compressing the time scale, the locations and magnitudes are preserved while modifying their rate of change in time. To effect a rate of change b, the synthesizer interpolation rate R' (see FIGURE 1) is given by R' = bR. Furthermore, wlth this system ~t is straightforward to invoke a time-varying rate of change since frequencies may be stretched or compressed by varying the interpolation rate in time.
FIGURE 6 shows a block diagram of a more comprehensfve system in which phases are measured directly. In ~his system the frequency components and their amplitudes are determined ln the same manner as the magnitude-only system described above and illustrated in FIGURE 1. Phase measurements, however, are derived directly from the discrete Fourier transform by computing the arctangents at the estimated frequency peaks.
Since ln the comprehensive system of FIGURE
6 a set of amplitudes, frequencies and phases are ~5 estimated for each frame, it might seem reasonable to estlmate the original speech waveform on the k'th frame by generating synthetic speech using the equation, s(n) = ~ Ak cos ~n~k ~ ~k~ (7) for kN < n < (k~1)N. Due to the time-varying nature of the parameters, however, this straightforward approach leads to discont~nuities at the frame -- - J
~:43~1L2Z

1 boundaries which seriously degrades the quality ~f the synthetic speech. Therefore, a method must be found for smoothly interpolating the parameters measured from one frame to those that are obtained on the next.
As a result of the frequency matching algorithm described in the previous section, all of the parameters measured for an arbitrary frame k are associated with a corresponding set of parametPrs for lo frame k+li Letting [AQk, ~Q, eQ] and [AQ
~Q , ~Q ] denote the successive sets of parameters for the l'th frequency track, then an obvious solution to the amplitude interpolation problem is to take A(n) - Ak+ ( Ak+l _ Ak ) n (8) N

where n = 1, 2, , N is the time sample into the k'th frame (The track subscript "Q" has been omitted for convenience).
Unfortunately such a simple approach cannot be used to interpolate the frequency and phase because the measured phase, e , is obta~ned modulo 2~. Hence, phase unwrapping must be performed to insure that the frequency tracks are "maximally smooth" across frame boundaries. The first step in solving this problem ~s to postulate a phase interpolation function that is a cubic polynomial, namely ~(t~ yt + at2 ~ st3 (9) -1~9L3~22 _17-l It is convenient to treat the phase function as though it were a function of a continuous time variable t, with t=O corresponding to frame k and t=T
corresponding to frame k+l.
The parameters o~ ~he polynomial must be chosen to satisfy frequency and phase measurements obtained at the frame boundaries. Since the instantaneous frequency is the derivatiYe of the phase, then o(t) = y + 2 at + 3 st2 (10) and it follows that at the starting point, t = O, e(O) = ~ = 9k 0(O) = ~ = ~k (11) and at the terminal point, t = T
e(T) = ek ~ ~kT ~ aT2 + BT3 = 0k+1 + 2 ~M

e(T) = ~k + 2 aT + 3 BT2 = ~k+1 (12) where again the track subscript "Q" is om~tted for convenience.
Since the terminal phase sk 1 is measured modulo 2~, it is necessary to augment it by the term ~o 2~M (M is an integer) in order to make the resulting frequency function "maximally smooth". At this point M is unknown, but for each value of M, whatever it may be, (12) can be solved for a(M) and s(M), (the dependence on M has now been made expl~cit). The solution is eas~ly shown to satisfy the matrix equation:

33~2~

k~ k kT ~ 2 (M k~ k In order to determine M and ultimately the solution to the phase unwrapping problem, an additional constraint needs to be imposed that quantifies the "maximally smoo~h" criter~on. FIGURE
7 illustrates a typical set of cubic phase interpolation functions for a number of values of M.
It seems clear on intuitive grounds that the best phase function to p~ck is the one that would have the least variation. This is what is meant by a maximally smooth frequency track. In fact, if the frequencies were constant and the vocal tract were stationary, the true phase would be linear.
Therefore a reasonable criterion for "smoothness" is 5 to choose M such that ,T
f(M) = J [e(t;M)]2 dt (14) is a minimum, where a(t;M) denotes second derivative of e(t;M) with respect to the time variable t.
Although M is integer valued, since f(M) is quadratic in M, the problem is most easily solved by minimizing f(x) with respect to the continuous variable x and then choosing M to be the integer closest to x. After straightforward but tedious algebra, it can be shown that the minimizing value of x is 2~ ~tek + ~kT - ek+1) + (~k+1 ~k) ](1 ::L2~3~2Z

l from which M* is determined and used In (13) to compute ~(M*) and s~M*), and in turn, the unwrapped ; phase interpola~ion function e(t) = 0k ~ ~kt ~ ~(M*)t2 + B(M*)t3 (16)

5 This phase function no~ only satisfies all of ~he measured phase and frequency endpoin~ constralnts, but also unwraps the phase in such a way that s(t) Is maximally smooth.
Since the above analysis began with the assumption of an initial unwrapped phase ak corresponding to frequency ~k at the start of frame k, it is necessary to spec~fy the initialization of the frame interpo1ation procedure. This is done by noting that at some point in time the track under 15 study was born. When this event occurred, an amplitude, frequency and phase were measured at frame k+l and the parameters at frame k to which these measurements correspond were defined by setting the amplitude to zero (i.e., Ak = o) while maintaining 20 the sàme frequency (i.e. ~k = ~k'1) I d to insure that the phase interpolation constraints are satisfied initially, the unwrapped phase ~s defined to be the measured phase ek 1 and the start-up phase is deflned to be ~3k 0k~1 ~k ' 1 N (17) where N is the number of samples traversed in going from frame k+1 back to frame k.
As a result of the above phase unwrapping procedure, each frequency track will have assoc~ated with it an instantaneous unwrapped phase which accounts for both the rapid phase changes due to the 3~2~

1 frequency of each sinusoidal component, and the slowly varying phase changes due to the glo~tal pulse and the vocal track transfer function. Letting eQ(t) denote the unwrapped phase function for the 'th track, then the final synthetic waveform will be given by L(k) stn) =~ AQ(n) cos [cQ(n)] (18) where kN < n < (k~1~N, AQ(n) is given by (8), eQ(n) ~s the sampled data version of (16), and L(k) is the number of sine waves estimated for the k'th frame.
The invention as described in connection with FIGURE 6 has been used to develop a speech coding system for operation at 8 kilobits per second. At this rate, h~gh-quali~y speech depends critically on the phase measurements and, thus, phase coding is a high priority. Since the sinusoidal representation also requires the specification of the amplitudes and frequencies, it is clear that relatively few peaks can be coded before all of the available bits were used. The first step, therefore, ~s to significantly reduce the number of parameters that must be coded. One way to do this is to force all of the frequencies to be harmon~c.
During voiced speech one would expect all of the peaks to be harmonically related and therefore, by coding the fundamental, the locat~ons of all of the frequencies will be ava~lable at the receiver.
Dur~ng unvoiced speech the frequency locations of ~he peaks will not be harmonic in this case. However, it is well known from random process theory that noise-like waveforms can be represented (in an ensemble mean-squared error sense) in terms of a A

~2~2Z

1 harmonic expansion of sine waves provided the spac~ng between adjacen~ harmon~cs is small enough that there i 5 1ittle change in the power spec~rum envelope (i.eO
interYals less than about 100 Hz). This S representation preserves the s~atistical properties of the input speech provided the amplitudes and phases are randomly varying from frame to frame.
Since the amplitudes and phases are to be coded, this random variation inherent in the measurement variables can be preserved ~n the synthe~ic waveform.
As a practical matter it is preferable to estimate the fundamental frequency that characterizes the set of frequencies in each frame, which in turn relates to pitch extraction. For example, pitch 15 extraction can be accomplished by selecting the fundamental frequency of a harmonic set of sine waves to produce the best fit to the input waveform according to a perceptual cr~ter~on. Other pitch extraction techniques can also be employed.
As an ~mmediate consequence of using the harmon~c frequency model, it follows ~hat the number of sine wave components to be coded is the bandwidth of the coded speech divided by the fundamental.
Since there is no guarantee that the number of 25 measured peaks will equal this harmonic number, provision should be made for adjusting the number of peaks to be coded. Based on the fundamental, a set of harmonic frequency bins are established and the number of peaks falling within each bin are examined. If more than one peak is found, then only the amplitude and phase corresponding to the largest peak are retained for coding. If there are no peaks in a given bin~ then a fictious peak is created ~3~2 _22-1 having an amplitude and phase obtained by sampling the short-time Fourier Transform at` the ~requency correspond~ng to the center of the bin.
The amplitudes are then coded by applying the same techniques used in channel vocoders. That is, a gain level is set, for example, by using 5 bits with 2 dB per level to co~e the amplitude of a first peak (i.e. the first peak above 300 Hz).
Subsequent peaks are coded logarithmically us~ng delta-modulation techniques across frequency. In one simulation 3.6 kbps were assigned to code the amplitudes at a 50 Hz frame rate. Adaptive bit allocation rules can be used to asslgn bits to peaks. For example, if the pitch is high there will ~5 be relatively few peaks to code, and there will be more bits per peak. Conversely when the pitch is low there will be relativeiy few bits per peak, but s~nce the peaks will be closer together their values will be more correlated, hence the ADPCM coder should be 20 able to track them well.
To code the phases a f~xed number of bits per peak (typically 4 or 5) is used. One method for coding the phases is to assign the measured phase to one of 2" equal subd~visions of -~ to ~ region, where 25 n= 4 or 5. Another method uses the frequency track corresponding to the phase (to be coded) to pred~c~
the phase at the end of the current frame, unwrap the value, and then code the phase residual us~ng ADPCM
techniques with 4 or 5 bits per phase peak. Since there remains only 4.4 kbps to code the phases and the fundamental (7 bits are used), then at a 50 Hz frame rate, it w~ll be poss~ble to code at most 16 peaks. At a 4 kHz speech bandwidth and four bits per phase, all of the phases w~ll be coded prov~ded the , j , 33L~

1 pitch is greater than 250 Hz. If the pitch ~s less than 250 Hz provision has to be made for regenerating a phase track for the uncoded high fre~uency peaks.
This is done by computing a differential frequency s that is the difference between the derivatfve of the instan~aneous cubic phase and the linear -` interpolation of the end po~nt frequencies for that track. The differential frequency is translated to the high frequency region by adding it ~o ~he linear interpolation of the end point frequencies corresponding to the track of the uncoded phase. The resulting Instan~aneous frequency function is then integrated to give the ~nstantaneous phase function that is applied to the sine wave generator. In this lS way the phase coherence intrinsic in the voiced speech and the phase incoherence characteristic of unvoiced speech is effectively translated to the uncoded frequency regions.
In FIGURE 8 another embodiment of the invention is shown, particularly adapted for time-scale modificat~on. In thls illustration, the representative sine waves are further defined to consist of system contributions (i.e. from the vocal tract) and excitation contributions (i.e. from the vocal chords). The excitation phase contributions are singled out for cubic interpolat~on. The procedure generally follows that described above ~n connection w~th other embodiments, however, in a further step the measured amplitudes Ak and phases eQ are decomposed into vocal tract and exc~tation components. The approach ~s to first form estimates of the vocal tract amplitude and phase as functions ~.; 43~22 _24-1 of frequency at each analysis ~rame (i.e., M(~,kR) and ~(~,kR)). System amplitude and phase estimates at the selected frequencies ~Q are then given by:

Mk = M(~k,kR) (19) and ~k = (~kJkR) (20) Finally, the exc~tation parameter estimates at each analys~s frame boundary are obtained as k k k aQ = AQ/MQ (21) and Qk ek ~k (22~

The decomposition problem then becomes that of estimating M(~,kR~ and ~(~,kR) as functions of frequency from the high resolution spectrum X(~,kR).
(In practice, of course, uniformly spaced frequency samples are available from the DFT.) There ex~st a number of established ways for separating out the system magnitude from the high-resolution spectrum, such as all-pole modeling and homomorphic 20 deconvolution. If the vocal tract transfer function is assumed to be minimum phase then the logarithm of the system magnitude and the system phase form a Hilbert transform pair. Under this condit~on, a phase estimate ~(~,kR) can be derived from the 25 logarithm of a magnitude estlmate M(~,kR) of the 3~L2:~

1 system function through the Hilbert transform.
Furthermore, the resulting phase estimate will be smooth and unwrapped as a functlon of frequency.
One approach to estima~ion of the system magnitude, and the corresponding estimation of the system phase through the use of the Hilbert Transform is shown in FIGURE 9 and is based on a homomorphic transformation. In this technique, the separation of the system amplitude from the high-resolution spectrum and the computation of the Hilbert transform of this amplitude estimate are in effect performed simultaneously. The Fourier transform of the logarithm of the high-resolution magnitude ~s first computed to obtain the "cepstrum". A right-sided window, with duration proportional to the average pitch period, is then applied. The ~maginary component of the resulting ~nverse Fourier transform is the desired phase and the real part is ~he smoo~h log-magnitude. In practice, uniformly spaced samples of the Fourler transform are computed w~th the FFT.
The length of the FFT was chosen at 512 wh~ch was sufficiently large to avoid aliasing in the cepstrum. Thus, the high-resolution spectrum used to estimate the sinewave frequencies is also used to ~5 estimate the vocal-tract system function.
The remaining analysis steps in the time-scale modifying system of FIGURE 8 are analogous to those described above in connection with the other embodiments. As a result of the matching algorithm, all of the amplitudes and phases of the excitat~on and system components measured for an arbitrary frame k are associated with a corresponding set of parameters for frame k+l. The next step In the synthesis is to interpolate the matched excitation ~3~22 _26-1 and system parameters across frame boundaries. The interpolation procedures are based on the assumption tha~ the excitation and sys~em functions are slowly-varying across frame boundaries. This is consistent with the assumption tha~ the model parameters are slowly-varying relative to the duration of the vocal tract impulse response. Since this slowly-varying constraint maps to a slowly-varying excitation and system amplitude, it suffices to interpolate these functions linearly~
Since the vocal tract system is assumed slowly varying over consecutive frames, it is reasonable to assume tha~ its phase Is slowly-varying as well and thus linear interpolation of the phase samples will also suffice~ However~ the characteristic of "s10wly-varying" is more diff~cult to achieve for the system phase than for the system magnitude. This is because an additfonal constralnt must be imposed on the measured phase; namely that the phase be smooth and unwrapped as a function of frequency at each frame boundary. There ~t is shown that if the system phase is obtained modulo 2~ then linear Interpolation can result ~n a (falsely) rapidly-vary~ng system phase between frame boundaries. The importance of the use of a homomorphic analyser of FIGURE 9 is now evident. The system phase estimate derived from the homomorphic analysis is unwrapped in frequency and thus slowly-varying when the system amplitude (from which it was derived) is slowly-varying. Linear interpolation of samples of this function results then in a phase trajectory wh~oh re~lects ~he underlying vocal tract movement. This phase funct~on is referred to as ~Q(t) where ~Q(o) corresponds to the ~Q of Equation 22. F~nally, as before, a cubic ~2~3~

1 polynomial is employed to interpolate the excitation phase and frequency. This will be referred to QQ(t) where ~Q(o) corresponds to QQk 0~ Equation 22.
The goal of t~me-scale modification is to maintain the perceptual quality of the original speech while changing the apparent rate of articulation. This implies that the frequency trajectories of the excitation (and thus ~he pitch contour) are stretched or compressed in time and ~he vocal tract changes at a slower or faster rate. The synthesis method of the previous section is ideally suited for this transformation s~nce it involves summing s~ne waves composed of vocal cord exc~tation and vocal tract system contributions for which explic~ funct~onal expressions have been derived.
Speech events which take place at a time to according to the new time scale will have occurred at p~ to ~n the original time scale. To apply the above sine wave model to time-scale ~o modification, the "events" which are t~me-scaled are the system amplitudes and phases, and the exc~tation amplitudes and frequenc~es, along each frequency track. Since the parameter estimates of the unmodified synthesis are available as continuous functions o~ time, then in theory, any rate change is possible. In conjunction with the Equations (19) -(22) the time scaled synthetic waveform can be expressed as:

s'(n) = ~ AQ (P~1n)cos[QQ(P-ln)/p~ Q(P~1n)](23 where L (n) is the number of sine waves estimated at time n. The required values in equatlon (23) are obtained by slmply scallng AQ(t), QQ(t) and ~Q(t) at a time p 1n and sealing the resulting excitation phase by p ~43~

1 ~ith the proposed time-scale modification system, it is also straightforward to apply a time-varying rate change. Here the time-warp~ng transformation is giYen by to = W(to) = ¦ p(T)dT (24) where p(T) is the desired ~ime-varying rate change.
In this generalization, each time-differential dT is scaled by a different factor p(T). Speech even~s which take place at a time to in the new time scale will now occur at a time to = W l(to) in the original time scale. If to maps back to to~ then one approximat~on is given by:
t1 ~ to ~ P-1(to) (25) Since the parameters of the sinusoidal components are available as continuous func~ions of time, they can always be found at the requ~red t1.
Letting tn denote the ~nverse to time tn=n , the synthetic waveform is then given by:
L(n) ~0 s' (n)=~ AQ(tn) cos[ QQ (tn) +~Q(tn) (26) where QQ(n)=nQ(n~ Q(tn) (27) and .
tn =tn~ p-l(tn-ll) (28) where ~(t) is a quadratic function g~ven by the first derivatlve of the cubic phase funct~on nQ(t) ~3~2~
_29-: l and where to = (29) At the time a particular track is born, the cubic phase function QQ(n) is initlalized by the - 5 value Pttn') QQ(~n') where nQ(~n') is the initial excitation phase obtained using (17).
It should also be appreciated that the invention can be used to perform frequency and pitch scaling. The short time spectral envelope of the synthetic waveform can be varied by scal~ng each frequency component and ~he pitch of the synthetic waveform can be altered by scaling the excitation-con~ributed frequency components.
In FIGURE 10 a final embodiment of the invention is shown which has been implemented and operated in real time. The illus~rated embodiment was implemented in 16-bit fixed point arithmetic using four Lincoln Digital Signai Processors (LDSPs). The foreground program operates on every input A/D sample collecting 100 input speech samples into 10 msec buffers. At the same time a 10 msec buffer of syntheslzed speech is played out through a D/A converter. At the end of each frame, the most recen~ speech is pushed down into a 600 msec buffer.
It ~s from this buffer that the data for the pitch-adaptive Hamming w~ndow is drawn and on wh~ch a 512 point Fast Fourier Transform (FFT) is applied.
Next a set of amplitudes and frequencies is obtained by locating the peaks of the magnitude of the FFT.
The data is supplied to the pitch extraction module from which is generated the pitch estimate that controls the pitch-adaptive windows. This parameter is also supplied to the codiny module in the data compression application. Once the pitch has been ~2~3~
_30-estimated another pitch adaptive Hamming window is buffered and transferred to another LDSP for parallel computation, Another 512 point FFT is taken for the purpose of estimating the amplitudes, frequencies and phases, to which the coding and speech modification methods wfll be applied. Once these peaks have been determined ~he frequency tracking and phase ` interpolation methods are implemented. Depending upon the application, these parameters would be coded or modified to effect a speech transformation and transferred to another pair of LDSPs, where the sum of sine waves synthes~s is implemented. The resulting synthetic waveform is then transferred back to the master LDSP where it is put into the appropriate buffer to be accessed by the foreground program for D/A output.

Claims

1. A method of processing an acoustic waveform, the method comprising:
a. sampling the waveform to obtain a series of discrete samples and constructing therefrom a series of frames, each frame spanning a plurality of samples;
b. analyzing each frame of samples to extract a set of frequency components having individual amplitudes;
c. tracking said components from one frame to a next frame; and d. interpolating the values of the components from the one frame to the next frame to obtain a parametric representation of the waveform whereby a synthetic waveform can be constructed by generating a set of sine waves corresponding to the interpolated values of the parametric representation.

2. The method of claim 1 wherein the step of sampling further including constructing a frame having variable length, which varies in accordance with the pitch period, the length being at least twice the pitch period of the waveform.

3. The method of claim 1 wherein the step of sampling further includes sampling the waveform according to a Hamming window.

4. The method of claim 1 wherein the step of analyzing further includes analyzing each frame by Fourier analysis.

5. The method of claim 1 wherein the step of analyzing further includes selecting a harmonic series to approximate the frequency components.

6. The method of claim 5 wherein the number of frequency components in the harmonic series varies according to the pitch period of the waveform.

7. The method of claim 1 wherein the step of tracking further includes matching a frequency component from the one frame with a component in the next frame having a similar value.

8, The method of claim 7 wherein said matching further provides for the birth of new frequency components and the death of old frequency components.

9. The method of claim 1 wherein the step of interpolating values further includes defining a series of instantaneous frequency values by interpolating matched frequency components from the one frame to the next frame and then integrating the series of instantaneous frequency values to obtain a series of interpolated phase values.

10. The method of claim 1 wherein the step of interpolating further includes deriving phase values from frequency and phase measurements taken at each frame and then interpolating the phase measurements.

11. The method of claim 1 wherein the step of interpolating is achieved by performing an overlap and add function.

12. The method of claim 1 wherein the method further includes coding the frequency components for digital transmission.

13. The method of claim 12 wherein the frequency components are limited to a predetermined number defined by a plurality of harmonic frequency bins.

14. The method of claim 13 wherein the amplitude of only one of said components is coded for gain and the amplitudes of the others are coded relative to the neighboring component at the next lowest frequency.

15. The method of claim 12 wherein the phases are coded by applying pulse code modulation techniques to a predicted phase residual.

16. The method of claim 12 wherein high frequency regeneration is applied.

17. The method of claim 1 wherein the method further comprises constructing a synthetic waveform by generating a series of constituent sine waves corresponding in frequency and amplitude to the extracted components.

18. The method of claim 17 wherein the time-scale of said reconstructed waveform is varied by changing the rate at which said series of constituent sine waves are interpolated.

19. The method of claim 18 wherein the time-scale is continuously variable over a defined range.

20. The method of claim 1 wherein the method further comprises constructing a synthetic waveform by generating a series of constituent sine waves corresponding in frequency, amplitude, and phase to the extracted components.

21. The method of claim 20 wherein the time-scale of said reconstructed waveform is varied by changing the rate at which said series of constituent sine waves are interpolated.

22. The method of claim 21 wherein the time-scale is continuously variable over a defined range.

23. The method of claim 20 wherein the constituent sine waves are further defined by system contributions and excitation contributions and wherein the time-scale of said reconstructed waveform is varied by changing the rate at which parameters defining the system contributions of the sine waves are interpolated.

24. The method of claim 17 wherein the short-time spectral envelope of the synthetic waveform is varied by scaling each frequency component.

25. The method of claim 23 wherein the pitch of the synthetic waveform is altered by scaling the excitation-contributed frequency components.

26. A device for processing an acoustic waveform, the device comprising:
a. sampling means for sampling the waveform to obtain a series of discrete samples and constructing therefrom a series of frames, each frame spanning a plurality of samples;

b. analyzing means for analyzing each frame of samples to extract a set of frequency components having individual amplitudes;
c. tracking means for tracking said components from one frame to a next frame; and d. interpolating means for interpolating the values of the components from the one frame to the next frame to obtain a parametric representation of the waveform whereby a synthetic waveform can be constructed by generating a set of sine waves corresponding to the interpolated values of the parametric representation.

27. The device of claim 26 wherein the sampling means further includes means for constructing a frame having variable length, which varies in accordance with the pitch period, the length being at least twice the pitch period of the waveform.

28. The device of claim 26 wherein the sampling means further includes means for sampling according to a Hamming window.

29. The device of claim 26 wherein the analyzing means further includes means for analyzing each frame by Fourier analysis.

30. The device of claim 26 wherein the analyzing means further includes means for selecting a harmonic series to approximate the frequency components.

31. The device of claim 30 wherein the number of frequency components in the harmonic series varies according to the pitch period of the waveform.

32. The device of claim 26 wherein the tracking means further includes means for matching a frequency component from the one frame with a component in the next frame having a similar value.

33. The device of claim 32 wherein said matching means further provides for the birth of new frequency components and the death of old frequency components.

34. The device of claim 26 wherein the interpolating means further includes means defining a series of instantaneous frequency values by interpolating matched frequency components from the one frame to the next frame and means for integrating the series of instantaneous frequency values to obtain a series of interpolated phase values.

35. The device of claim 26 wherein the interpolating means further includes means for deriving phase values from the frequency and phase measurements taken at each frame and then interpolating the phase measurements.

36. The device of claim 26 wherein the interpolating means further includes means for performing an overlap and add function.

37. The device of claim 26 wherein the device further includes coding means for coding the frequency components for digital transmission.

38. The device of claim 32 wherein the frequency components are limited to a predetermined number defined by a plurality of harmonic frequency bins.

39. The device of claim 38 wherein the amplitude of only one of said components is coded for gain and the amplitudes of the others are coded relative to the neighboring component of the next lowest frequency.

40. The device of claim 37 wherein the coding means further comprises means for applying pulse code modulation techniques to a predicted phase residual.

41. The device of claim 37 wherein the coding means further comprises means for generating high frequency components.

42. The device of claim 21 wherein the device further comprises means for constructing a synthetic waveform by generating a series of constituent sine waves corresponding in frequency and amplitude to the extracted components.

43. The device of claim 42 wherein the time-scale of said reconstructed waveform is varied by changing the rate at which said series of constituent sine waves are interpolated.

44. The device of claim 43 wherein the time-scale is continuously variable over a defined range.

45. The device of claim 26 wherein the device further comprises means for constructing a synthetic waveform by generating a series of constituent sine waves corresponding in frequency, amplitude, and phase to the extracted components.

46. The device of claim 45 wherein the time-scale of said reconstructed waveform is varied by changing the rate at which said series of constituent sine waves are interpolated.

47. The device of claim 46 wherein the time-scale is continuously variable over a defined range.

48. The device of claim 42 wherein the constituent sine waves are further defined by system contributions and excitation contributions and wherein the time-scale of said reconstructed waveform is varied by changing the rate at which parameters defining the system contributions of the sine waves are interpolated.

49. The device of claim 48 wherein the device further includes a scaling means for scaling the frequency components.

50. The device of claim 48 wherein the device further includes a scaling means for scaling the excitation-contributed frequency components.

51. A speech coding device comprising:
a. sampling means for sampling the waveform to obtain a series of discrete samples and constructing therefrom a series of frames, each frame spanning a plurality of samples;
b. analyzing means for analyzing each frame of samples by Fourier analysis to extract a set of frequency components having individual amplitude values;
c. tracking means for tracking the components from one frame to a next frame; and d. coding means for coding the component values.

52. The device of claim 51 wherein the coding means further includes means for selecting a harmonic series of bins to approximate the frequency components and the number of bins varies according to the pitch of the waveform.

53. The device of claim 51 wherein the amplitude of only one of said components is coded for gain and the amplitudes of the other components are coded relative to the neighboring component at the next lowest frequency.

54. The device of claim 51 wherein the amplitudes of the components are coded by linear prediction techniques.

55. The device of claim 51 wherein the amplitudes of the components are coded by adaptive delta modulation techniques.

56. The device of claim 51 wherein the analyzing means further comprises means for measuring phase values for each frequency component.

57. The device of claim 56 wherein the coding means further includes means for coding the phase values by applying pulse code modulations to a predicted phase residual.

58. The device of claim 56 wherein the coding means further includes means for generating high frequency component values from coded low frequency component values.

59. A device for altering the time-scale of an audible waveform, the device comprising:
a. sampling means for sampling the waveform to obtain a series of discrete samples and constructing therefrom a series of frames, each frame spanning a plurality of samples;
b. analyzing means for analyzing each frame of samples to extract a set of frequency components having individual amplitudes;
c. tracking means for tracking said components from one frame to a next frame;
d. interpolating means for interpolating the amplitude and frequency values of the components from the one frame to the next frame to obtain a representation of the waveform whereby a synthetic waveform can be constructed by generating a set of sine waves corresponding to the interpolated representation;
e. scaling means for altering the rate of interpolation; and f. synthesizing means for constructing a time-scaled synthetic waveform by generating a series of constituent sine waves corresponding in frequency and amplitude to the extracted components, the sine waves being generated at said alterable interpolation rate.

60. The device of claim 59 wherein the time scale is continuously variable over a defined range.

61. The device of claim 59 wherein the analyzing mean further comprises means for measuring phase values for each frequency component.

62. The device of claim 61 wherein the component phase values are interpolated by cubic interpolation.

63. The device of claim 61 wherein the time-scale is continuously variable over a defined range.

64. The device of claim 61 wherein the device further comprises means for separating the measured frequency components into system contributions and excitation contributions and wherein the time-scale of the synthetic waveform is varied by altering the rate at which values defining the system contributions are interpolated.

65. The device of claim 64 wherein the scaling means alters the rate at which the system amplitudes and phases, and the excitation amplitudes and frequencies are interpolated.