WO1996018187A1

WO1996018187A1 - Method and apparatus for parameterization of speech excitation waveforms

Info

Publication number: WO1996018187A1
Application number: PCT/US1995/012174
Authority: WO
Inventors: Chad Scott Bergstrom; Bruce Alan Fette; Cynthia Ann Jaskie; Clifford Wood; Sean Sungsoo You
Original assignee: Motorola Inc.
Priority date: 1994-12-05
Filing date: 1995-09-25
Publication date: 1996-06-13
Also published as: AU3723995A; AR000105A1

Abstract

A speech vocoder device and corresponding method parameterizes speech excitation waveforms. An analysis portion performs an excitation pulse compression process (46) which filters (64, 66, 68) an excitation template to produce compressed excitation from which excitation parameters are estimated (74). An optimal excitation target is selected using a closed-loop process (48) that selects the target based on a minimum error (158) between the original waveform and waveforms created by interpolating (156) between candidate targets. An adaptive excitation weighting function is created (178) based on the excitation target's features and a preselected characterization methodology, and the function is applied (180) to the excitation target. The excitation is characterized (52) and encoded (54) for digital transmission.

Description

METHOD AND APPARATUS FOR PARAMETERIZATION OF SPEECH EXCITATION WAVEFORMS

Field of the Invention

The present invention relates generally to the field of encoding signals having periodic components and, more particularly, to techniques and devices for digitally encoding speech waveforms.

Background of the Invention

Voice coders, referred to commonly as "vocoders", compress and decompress speech data. Vocoders allow a digital communication system to increase the number of system communication channels by decreasing the bandwidth allocated to each channel. Fundamentally, a vocoder implements specialized signal processing techniques to analyze or compress speech data at an analysis device and synthesize or decompress the speech data at a synthesis device. Speech data compression typically involves parametric analysis techniques, whereby the fundamental or "basis" elements of the speech signal are extracted. These extracted basis elements are encoded and sent to the synthesis device in order to provide for reduction in the amount of transmitted or stored data. At the synthesis device, the basis elements may be used to reconstruct an approximation of the original speech signal. Because the synthesized speech is typically an inexact approximation derived from the basis elements, a listener at the synthesis device may detect voice quality which is inferior to the original speech signal. This is particularly true for vocoders that compress the speech signal to low bit rates, where less information

about the original speech signal may be transmitted or stored.

A number of voice coding methodologies extract the speech basis elements by using a linear predictive coding (LPC) analysis of speech, resulting in prediction coefficients that describe an all-pole vocal tract transfer function. LPC analysis generates an "excitation" waveform that represents the driving function of the transfer function. Ideally, if the LPC coefficients and the excitation waveform could be transmitted to the synthesis device exactly, the excitation waveform could be used as a driving function for the vocal tract transfer function, exactly reproducing the input speech. In practice, however, the bit-rate limitations of a communication system will not allow for complete transmission of the excitation waveform.

Speech basis elements include parametric components of the excitation waveform. Accurate parameterization (i.e., the extraction and subsequent representation of the basis elements) is difficult to achieve at low bit rates. Extraction of excitation parameters, such as voicing modes, pitch, and excitation position can be complicated by the presence of excitation epoch dispersion (i.e., distribution of excitation impulse energy across an epoch period), non-periodic excitation components (e.g., white or colored noise), secondary excitation (e.g.. additional periodic components), and large excitation amplitude variations. Prior-art methods, such as LPClOe (FS 1015), do not adequately maintain the excitation characteristics that are necessary for the reproduction of high- quality speech.

A complication of inadequate, low bit-rate excitation parameterization is interpolation error. Such error can be manifest, for example, in the form of variations in the reconstructed excitation waveform. Significant deviation of the interpolated waveform energy relative to the original waveform can lead to audible artifacts in the synthesized speech. Prior-art methods select interpolation targets independently with respect to waveform amplitude characteristics and interpolation outcomes, often resulting in significant variations in overall waveform energy with respect to the original residual excitation waveform.

Accurate parameterization at low bit rates is also difficult to achieve due to the asymmetrical, time- varying nature of the pitch-synchronous excitation waveform segments. As such, prior-art static weighting functions are inappropriate for pitch- synchronous modeling. Such fixed-envelope weighting functions can distort the fundamental excitation epoch characteristics that are necessary for high-quality reproduction of speech. Global trends toward complex, high-capacity telecommunications emphasize a growing need for high-quality speech coding techniques that require less bandwidth. Near-future telecommunications networks will continue to demand very high-quality voice communications at the lowest possible bit rates. Military applications, such as cockpit communications and mobile radios, demand higher levels of voice quality. In order to produce high-quality speech, limited-bandwidth systems must be able to accurately reconstruct the salient waveform features after transmission or storage. Thus, what are needed are an excitation pulse compression method and apparatus that serves to refine waveform parameter estimates by overcoming complications introduced by excitation epoch dispersion, non-periodic excitation components, secondary excitation, and the wide variance in excitation amplitudes. What are further needed are an apparatus and a method using a more intelligent basis for interpolation target selection that reduces nonrepresentative energy deviation that can result from the interpolation process during excitation waveform reconstruction. What are further needed in order to accommodate the time- varying envelope characteristics of the excitation waveform are an apparatus and method of applying an adaptive, time-varying weighting function which tailors its output towards the waveform characteristics of the excitation epoch being considered. What are further needed are a method and apparatus for parameterization of the speech excitation waveform that achieves high-quality speech after reconstruction of an excitation waveform estimate.

Brief Description of the Drawings

FIG. 1 shows an illustrative vocoder apparatus in accordance with a preferred embodiment of the present invention;

FIG. 2 illustrates a flow chart of a method for speech excitation analysis in accordance with a preferred embodiment of the present invention; FIG. 3 illustrates a flow chart of a method for compressing excitation pulses in accordance with a preferred embodiment of the present invention;

FIG. 4 illustrates an excitation template and time-domain matched pulse compression filter coefficients derived in accordance with a preferred embodiment of the present invention; FIG. 5 illustrates a flow chart of a method for determining all-pass pulse compression filter coefficients in accordance with an alternate embodiment of the present invention;

FIG. 6 illustrates a flow chart of a method for determining whitening pulse compression filter coefficients in accordance with another alternate embodiment of the present invention;

FIG. 7 shows an example of an excitation waveform and a pulse compressed excitation waveform derived in accordance with a preferred embodiment of the present invention;

FIG. 8 illustrates a flow chart of a method for closed-loop excitation target selection in accordance with a preferred embodiment of the present invention;

FIG. 9 illustrates exemplary waveforms derived using the closed-loop target selection method in accordance with a preferred embodiment of the present invention;

FIG. 10 illustrates a flow chart of a method for adaptively weighting an excitation waveform in accordance with a preferred embodiment of the present invention;

FIG. 11 shows an example of an adaptive weighting function derived in accordance with a preferred embodiment of the present invention; and FIG. 12 shows an example of an adaptive weighting function derived in accordance with a preferred embodiment of the present invention as it relates to the excitation portion from which it was derived; and

FIG. 13 shows an example of an adaptive weighting function derived in accordance with a preferred embodiment of the present invention that would preserve more existing secondary excitation in a given excitation segment.

Detailed Description of the Drawings

The present invention provides an accurate excitation waveform parameterization technique and apparatus that results in higher quality speech at lower bit rates than is possible with prior-art methods. Generally, the present invention introduces a new excitation parameterization method and apparatus that serve to maintain high voice quality when used in an appropriate excitation-based vocoder architecture. This method is applicable for implementation in new and existing voice coding platforms that require efficient, accurate excitation modeling algorithms. In such platforms, accurate modeling of the LPC-derived excitation waveform is essential in order to reproduce high-quality speech at low bit rates.

One advantage of the present invention is that it performs excitation pulse compression that serves to refine waveform parameter estimates by overcoming complications introduced by excitation epoch dispersion, non-periodic excitation components, secondary excitation, and the wide variance in excitation amplitudes.

Another advantage of the present invention is that it intelligently selects interpolation targets in order to reduce nonrepresentative energy deviations relative to the original waveform that can result from the excitation interpolation process. An additional advantage of the present invention is that it has the capability to generate an adaptive, time-varying weighting function that tailors its output toward the waveform characteristics of the excitation epoch being considered. In a preferred embodiment of the present invention, the vocoder apparatus desirably includes an analysis function that performs parameterization of the LPC-derived speech excitation waveform, and a synthesis function that performs reconstruction and speech synthesis of the parameterized excitation waveform. In the analysis function, basis excitation waveform elements are extracted from the LPC-derived excitation waveform by using the parameterization method discussed below. This results in parameters that accurately describe the LPC-derived excitation waveform at a significantly reduced bit-rate. In the synthesis function, these parameters may be used to reconstruct an accurate estimate of the excitation waveform, which may subsequently be used to generate a high-quality estimate of the original speech.

A. Vocoder Apparatus

FIG. 1 shows an illustrative vocoder apparatus in accordance with a preferred embodiment of the present invention. The vocoder apparatus comprises a vocoder analysis device 10 and a vocoder synthesis device 24. Vocoder analysis device 10 comprises analog-to-digital converter 14, analysis memory 16, analysis processor 18, and analysis modem 20. Microphone 12 is coupled to analog-to-digital converter 14 which converts analog voice signals from microphone 12 into digitized speech samples. Analog-to-digital converter 14 may be, for example, a 32044 codec available from Texas Instruments of Dallas, Texas. In a preferred embodiment, analog-to-digital converter 14 is coupled to analysis memory device 16. Analysis memory device 16 is coupled to analysis processor 18. In an alternate embodiment, analog-to-digital converter 14 is coupled directly to analysis processor 18. Analysis processor 18 may be, for example, a digital signal processor such as a DSP56001, DSP56002, DSP96002 or DSP56166 integrated circuit available from Motorola, Inc. of Schaumburg, Illinois.

In a preferred embodiment, analog-to-digital converter 14 produces digitized speech samples that are stored in analysis memory device 16. Analysis processor 18 extracts the sampled, digitized speech data from the analysis memory device 16. In an alternate embodiment, sampled, digitized speech data is stored directly in the memory or registers of analysis processor 18, thus eliminating the need for analysis memory device 16.

Analysis processor 18 performs parameterization and encoding functions on the sampled, digitized speech samples. These functions include analysis pre-processing, excitation pulse compression, closed-loop excitation target selection, adaptive excitation weighting, excitation characterization, and analysis post-processing. Several different methods of analysis pre-processing, excitation characterization, and analysis post- processing are well known to those of skill in the art. Analysis processor 18 also desirably includes functions of encoding the characterizing data using scalar quantization, vector quantization (VQ), split vector quantization, or multi-stage vector quantization codebooks. Analysis processor 18 produces an encoded bitstream of compressed speech data.

Analysis processor 18 is coupled to analysis modem 20 which accepts the encoded bitstream and prepares the bitstream for transmission using modulation techniques commonly known to those of skill in the art. Analysis modem 20 may be, for example, a V.32 modem available from Universal Data Systems of Huntsville, Alabama. Analysis modem 20 is coupled to communication channel 22, which may be any communication medium, such as fiber-optic cable, coaxial cable or a radio-frequency (RF) link. Other media may also be used as would be obvious to those of skill in the art based on the description herein.

Vocoder synthesis device 24 comprises synthesis modem 26, synthesis processor 28, synthesis memory 30, and digital-to-analog converter 32. Synthesis modem 26 is coupled to communication channel 22. Synthesis modem 26 accepts and demodulates the received, modulated bitstream. Synthesis modem 26 may be, for example, a V.32 modem available from Universal Data Systems of Huntsville, Alabama.

Synthesis modem 26 is coupled to synthesis processor 28. Synthesis processor 28 performs the decoding and synthesis of speech. Synthesis processor 28 may be, for example, a digital signal processor such as a DSP56001, DSP56002, DSP96002 or DSP 6166 integrated circuits available from Motorola, Inc. of Schaumburg, Illinois.

Synthesis processor 28 produces synthesized speech by performing the functions of synthesis pre-processing, desirably including decoding steps of scalar, vector, split vector, or multi-stage vector quantization codebooks. Synthesis processor 28 also performs the functions of excitation waveform reconstruction, speech synthesis, and synthesis post-processing. Many different methods exist to perform these functions which are well known to those of skill in the art.

In a preferred embodiment, synthesis processor 28 is coupled to synthesis memory device 30. In an alternate embodiment, synthesis processor 28 is coupled directly to digital-to-analog converter 32. Synthesis processor 28 stores the digitized, synthesized speech in synthesis memory device 30. Synthesis memory device 30 is coupled to digital-to-analog converter 32 which may be, for example, a 32044 codec available from Texas Instruments of Dallas, Texas. Digital-to-analog converter 32 converts the digitized, synthesized speech into an analog waveform appropriate for output to a speaker or other suitable output device 34.

For clarity and ease of understanding, FIG. 1 illustrates analysis device 10 and synthesis device 24 in separate physical devices. This configuration would provide simplex communication (i.e., communication in one direction only). Those of skill in the art would understand based on the description that an analysis device 10 and synthesis device 24 may be located in the same unit to provide half-duplex or full-duplex operation (i.e., communication in both the transmit and receive directions).

In an alternate embodiment, one or more processors may perform the functions of both analysis processor 18 and synthesis processor 28 without transmitting the encoded bitstream. The analysis processor would calculate the encoded bitstream and store the bitstream in a memory device. The synthesis processor could then retrieve the encoded bitstream from the memory device and perform synthesis functions, thus creating synthesized speech. The analysis processor and the synthesis processor may be a single processor as would be obvious to one of skill in the art based on the description. In the alternate embodiment, modems (e.g., analysis modem 20 and synthesis modem 26) would not be required to implement the present invention.

B. Speech Excitation Analysis

FIG. 2 illustrates a flow chart of a method for speech excitation analysis in accordance with a preferred embodiment of the present invention. The Excitation Analysis process is desirably carried out by analysis processor 18 (FIG. 1). The

Excitation Analysis process begins in step 40 (FIG. 2) by performing the Select Input Analysis Block step 42 which selects a finite number of digitized, input speech samples for processing. In a preferred embodiment, input speech samples are selected from an Input Speech Samples buffer 41 desirably located in analysis memory device 16 (FIG. 1). This finite number of input speech samples and companion excitation samples will be referred to herein as an "analysis block".

Next, the Analysis Pre-Processing step 44 performs high pass filtering, spectral pre-emphasis, and linear prediction coding (LPC) on the analysis block. These processes are well known to those skilled in the art. The result of the Analysis Pre- Processing step 44 is an LPC-derived excitation waveform, preliminary pitch estimate, preliminary voicing onset and offset locations (e.g., transitions from voiced to unvoiced speech and visa versa) within the analysis block, and preliminary excitation epoch positions within the analysis block. The preliminary voicing onset and offset locations and the preliminary excitation epoch positions are approximations of sample numbers within the analysis block of onset and offset locations and excitation epoch locations. As defined herein, an "epoch" or "excitation epoch" is a portion of the analysis block containing a single pitch period of excitation. After the Analysis Pre-processing step 44, the Compress Excitation Pulses step 46 accepts as input the LPC-derived excitation waveform for pulse compression. The Compress Excitation Pulses step 46 compresses the dispersed pulses of the LPC-derived excitation waveform, extracts excitation parameters from the pulse compressed waveform, and, optionally, performs periodic/non-periodic component decomposition. Extracted excitation parameters include a refined pitch estimate, refined excitation epoch positions, and refined voicing onset and offset locations, although other parameters may also be extracted. The Compress Excitation Pulses step 46 is described in more detail in conjunction with FIGS. 3-7. Next, the Closed-Loop Excitation Target Selection process 48 accepts extracted excitation parameters and the LPC-derived excitation waveform to perform closed-loop analysis and selection of an optimal excitation target within the analysis block. The optimal excitation target is chosen to minimize a selected error measure. The optimal excitation target is desirably synchronous to pitch. The Closed-Loop Excitation Target Selection process 48 is described in more detail in conjunction with FIG. 8.

After the Closed-Loop Excitation Target Selection process 48, the Adaptive Excitation Weighting process 50 is performed. The Adaptive Excitation Weighting process 50 derives an appropriate time- varying adaptive weighting function and applies it to the optimal excitation target, resulting in a weighted excitation portion. The adaptive weighting function attenuates the envelope of the optimal excitation target so as to preserve only those excitation components that are accurately represented by a characterization methodology pre-selected by the Adaptive Excitation Weighting process 50. The Adaptive Excitation Weighting process 50 is described in more detail in conjunction with FIGS. 9-11. Next, the Characterize Excitation step 52 produces characterization parameters from the weighted excitation portion. These characterization parameters include time domain and/or frequency domain values that serve to maintain the essence of the excitation target during transmission and subsequent reconstruction. Prior-art frequency- domain and time-domain characterization methods are well known to those of skill in the art.

The Analysis Post-Processing step 54 is then performed which desirably includes encoding steps of scalar quantization, vector quantization (VQ), split-vector quantization, or multi-stage vector quantization of the excitation parameters. These methods are well known to those of skill in the art. The result of the Analysis Post-Processing step 54 is a bitstream that contains the encoded speech data.

The Transmit or Store Bitstream step 56 accepts the bitstream from Analysis Post-Processing step 54 and either stores the bitstream to a memory device or transmits it to a modem (e.g., analysis modem 20, FIG. 1) for modulation and transmission.

The Excitation Analysis procedure then performs the Select Input Analysis Block step 42, and the procedure iterates as shown in FIG. 2.

After the basis elements of the speech have been extracted, encoded, and transmitted to the synthesis device 24 (FIG. 1), they are decoded , reconstructed, and used to synthesize an estimate of the original speech data.

1. Compress Excitation Pulses

Determination of voiced excitation parameters can be complicated by excitation epoch dispersion, non-periodic excitation components, secondary excitation, and wide variance in excitation amplitude. What is needed is an excitation pulse compression method that serves to refine waveform parameter estimates by overcoming these complications. The Compress Excitation Pulses process 60 (FIG. 3) overcomes these difficulties in parameter determination by transforming the excitation waveform into a train of narrow, symmetrical or near-symmetrical impulses. The impulse train may be used to refine parameter extraction of characteristics such as pitch period, onset or offset voicing, and epoch positions. Compressed excitation resulting from this method can also be applied toward periodic and nonperiodic decomposition techniques. Voice coding applications which utilize parametric modeling of excitation can utilize the Compress Excitation Pulses process 60 (FIG. 3) to refine waveform parameter estimates and decompose periodic and nonperiodic parts for application toward disjoint characterization methods. FIG. 3 illustrates a flow chart of a method for compressing excitation pulses in accordance with a preferred embodiment of the present invention. The Compress Excitation Pulses process begins in step 60 by performing the Select Excitation Template step 62. The Select Excitation Template step 62 selects a representative portion of the LPC-derived speech excitation waveform as a spectral magnitude and group delay template. In a preferred embodiment, the selected excitation template is pitch- synchronous, based upon coarse initial estimates of pitch and template position. However, it would be obvious to one of skill in the art based on the description that the selected excitation template may contain fewer samples than an estimated pitch period. Next, the Pre-weight Excitation Template step 63 may pre-weight the time- domain template using adaptive weighting functions embodied in this invention, or conventional weighting functions well known to those of skill in the art. As would be obvious to one of skill in the art based on the description herein, Pre-weight Excitation Template step 63 is optional.

Next, compression filter coefficients are determined by either the Determine Matched Compression Filter Coefficients step 64, the Determine All-Pass Filter Coefficients process 66 or the Determine Whitening Filter Coefficients process 68. The Determine All-Pass Filter Coefficients process 66 is described in more detail in conjunction with FIG. 5. The Determine Whitening Filter Coefficients process 68 is described in more detail in conjunction with FIG. 6.

In a preferred embodiment, the Determine Matched Compression Filter Coefficients step 64 is performed. The Determine Matched Compression Filter Coefficients step 64 overcomes deficiencies in the prior art without using computationally complex algorithms (e.g., Fast Fourier Transforms). Additionally, the Determine Matched Compression Filter Coefficients step 64 reduces time-domain pulse compressed waveform distortion possible with other methods.

The Determine Matched Compression Filter Coefficients step 64 determines matched filter coefficients that serve to cancel the group delay characteristics of the excitation template and excitation epochs in proximity to the excitation template. For example, an optimal ("opt") matched filter, familiar to those of skill in the art, may be defined by:

(Eqn. 1) H_opt(ω) - KX*(ω)e-J^ωT,

where H_opt(ω) is the frequency-domain transfer function of the matched filter, X*(ω) is the conjugate of an input signal spectrum (e.g., a spectrum of the excitation template) and K is a constant. Given the conjugation property of Fourier transforms:

(Eqn. 2) x*(-t) <--> X*(ω),

the impulse response of the optimum filter is given by:

(Eqn. 3) h_opt(t) - Kx*(T-t),

where h_opt(t) defines the time-domain matched compression filter coefficients, T is the "symbol interval", and x*(T-t) is the conjugate of a shifted mirror-image of the "symbol" x(t). The above relationships are applied to the excitation compression problem by considering the selected excitation template to be the symbol x(t). The symbol interval, T, is desirably the excitation template length. The time-domain matched compression filter coefficients, defined by hop)_t(t), are conveniently determined from Eqn. 3, thus ehminating the need for a frequency domain transformation (e.g.. Fast Fourier Transform) of the excitation template (as used with other methods). Constant K is desirably chosen to preserve overall energy characteristics of the filtered waveform relative to the original, and is desirably computed directly from the time-domain template. FIG. 4 illustrates an excitation template 90 and unsealed time-domain matched pulse compression filter coefficients 92 derived in accordance with a preferred embodiment of the present invention. As is shown in FIG. 4, the unsealed time-domain matched pulse compression filter coefficients 92 are a reversed representation of excitation template 90. The Determine Matched Compression Filter Coefficients step 64 provides a simple time-domain excitation pulse compression filter design method that eliminates computationally expensive Fourier Transform operations associated with other group delay removal techniques.

Referring back to FIG. 3, in an alternate embodiment, the Determine All -Pass Compression Filter Coefficients process 66 is performed. The Determine All-Pass

Compression Filter Coefficients process 66 minimizes time-domain waveform distortion by preserving the spectral magnitude characteristics of the excitation template and adjacent excitation while removing undesired dispersion or group delay characteristics.

FIG. 5 illustrates a flow chart of the Determine All-Pass Compression Filter Coefficients process 66 (FIG. 3) in accordance with an alternate embodiment of the present invention.

The Determine All-Pass Compression Filter Coefficients process begins in step 100 with the Estimate Group Delay step 102. The Estimate Group Delay step 102 first performs a time-domain to frequency-domain transformation on the excitation template from the Select Excitation Template step 62 (FIG. 3), resulting in an original group delay waveform.

The Estimate Group Delay step 102 then either directly samples the original group delay waveform or samples a filtered representation of the original group delay waveform to generate a decimated group delay waveform. The decimated group delay waveform is then interpolated in a linear or non-linear fashion to produce a group delay envelope estimate which approximates the original group delay envelope. As would be obvious to one of skill in the art based on the description, modulo-2Pi dealiasing is desirably performed prior to filtering and decimation, such methods being well known to those of skill in the art. After the Estimate Group Delay step 102, the Estimate Negative Group Delay step

104 generates a negative group delay envelope estimate from the original group delay waveform, or group delay envelope estimate. Next, the Compute All -Pass Compression Filter Coefficients step 106 utilizes a unit spectral magnitude and the negative group delay envelope estimate to derive the taps for an excitation pulse compression filter. Excitation pulse compression filter coefficients may be derived from the spectral information using methods well known to those of skill in the art.

The procedure then exits in step 108.

Referring back to FIG. 3, in another alternate embodiment, the Determine Whitening Compression Filter Coefficients process 68 is performed. The Determine Whitening Compression Filter Coefficients process 68 utilizes pulse compression filter coefficients to "whiten" the spectral components and remove the group delay of the excitation impulses in the proximity of the selected excitation template. "Whitening" the spectral components implies flattening the spectral magnitude of the excitation impulses using the pulse compression filter. This technique works well for excitation waveforms whose spectral characteristics in the proximity of the template are highly correlated with the spectral characteristic of the excitation template.

FIG. 6 illustrates a flow chart of the Determine Whitening Compression Filter Coefficients process 68 (FIG. 3) in accordance with another alternate embodiment of the present invention. The Determine Whitening Compression Filter Coefficients process begins in step 120 with the Estimate Magnitude and Group Delay step 122 which performs a time-domain to frequency-domain transformation on the excitation template from the Select Excitation Template step 62 (FIG. 3), resulting in a spectral magnitude waveform and a group delay waveform.

The Estimate Magnitude and Group Delay step 122 then either directly samples the original spectral magnitude and original group delay waveforms or samples filtered representations of the original spectral magnitude and original group delay waveforms to produce decimated waveforms. As would be obvious to one of skill in the art based on the description, modulo- 2Pi dealiasing is desirably performed prior to filtering and decimation, such methods being well known to those of skill in the art. The decimated waveforms are then interpolated in a linear or non-linear fashion to produce a magnitude envelope estimate and a group delay envelope estimate which approximate the original spectral magnitude and original group delay envelopes.

After the Estimate Magnitude and Group Delay step 122, the Estimate Inverse Magnitude and Negative Group Delay step 124 generates an inverse magnitude estimate and negative group delay estimate of the spectral data from the original magnitude and phase samples or the magnitude and phase estimates.

Next, the Estimate Whitening Compression Filter Coefficients step 126 utilizes the inverse spectral magnitude and negative group delay estimates to derive the taps for an excitation pulse compression filter. Excitation pulse compression filter coefficients may be derived from the spectral information using methods well known to those of skill in the art.

The procedure then exits in step 128. Referring again to FIG. 3, after either the Determine Matched Compression Filter

Coefficients step 64, the Determine All-Pass Compression Filter Coefficients process 66 or the Determine Whitening Compression Filter Coefficients process 68, the Pulse Compress Excitation step 70 filters the excitation waveform using a filter derived from the compression filter coefficients, resulting in a pulse compressed excitation waveform. The Pulse Compress Excitation step 70 reduces the excitation waveform to a pulse compressed excitation waveform which embodies both a periodic train of near- symmetrical impulses and low-level nonperiodic components.

As would be obvious to one of skill in the art based on the description herein, the Pulse Compress Excitation step 70 may also include interpolation of pulse compression filter coefficients, weight-overlap-add operations, and filter delay removal in order to ensure pulse compressed excitation waveform continuity. These general techniques are well known to those of skill in the art.

FIG. 7 shows an example of an excitation waveform 140 and a compressed excitation waveform 142 derived in accordance with a preferred embodiment of the present invention. Excitation waveform 140 represents an original, uncompressed excitation waveform that was derived from input speech samples chosen in accordance with the Select Input Analysis Block step 42 (FIG. 2). Pulse-compressed excitation waveform 142 represents the waveform derived from excitation waveform 140 in accordance with the present invention. Referring again to FIG. 3, the Blockwise Energy Normalization step 72 is then desirably performed, including energy normalization to match the energy envelope of the original excitation waveform. The Blockwise Energy Normalization step 72 measures the energy of the original excitation waveform and the corresponding pulse compressed excitation waveform. Then the pulse compressed excitation waveform is multiplied by the ratio of the measured energies to normalize the energy of the pulse compressed excitation waveform. This step may be omitted if normalization has been included in factor K of Eqn. 3. As would be obvious to one of skill in the art based on the description herein, the energy of the original excitation waveform may be measured prior to execution of the Blockwise Energy Normalization step 72. Next, the Estimate Excitation Parameters step 74 desirably evaluates the pulse compressed excitation waveform to estimate parameters such as excitation epoch positions, pitch period, and voice onset and offset locations using techniques that are commonly known to those of skill in the art. The procedure then exits in step 76.

As would be obvious to one of skill in the art based on the description herein, the Compress Excitation Pulses process 46 (FIG. 2) described above may also be implemented directly upon the speech waveform for the purposes of parameter extraction and refinement.

2. Closed-Loop Excitation Target Selection

A number of low-rate voice coding applications implement interpolation methods in order to reconstruct excitation waveform estimates. Significant deviation of the interpolated waveform energy relative to the original waveform can lead to audible artifacts in the synthesized speech. These types of interpolation errors can be caused by non-optimal excitation target selection. When the selection of interpolation target is made independently with respect to original excitation waveform characteristics and interpolation outcomes, significant interpolation error can result. Such interpolated excitation can display significant variations in overall waveform energy with respect to the original residual excitation waveform. What is needed is a more intelligent basis for target selection.

The Closed-Loop Excitation Target Selection process 48 (FIG. 2) overcomes such interpolation error by using a non-random, deterministic method for target selection. Reduced-bandwidth voice coding applications can utilize the Closed-Loop Excitation Target Selection process 48 (FIG. 2) as a way of improved modeling and reconstruction of the LPC-derived speech excitation waveform. These applications can benefit from closed-loop techniques that minimize the selected error measure between the actual and interpolated excitation waveforms. The Closed-Loop Excitation Target Selection process 48 (FIG. 2) reduces the nonrepresentative waveform variations that can result from the interpolation process, thus effectively smoothing the perceived character of the reconstructed speech.

FIG. 8 illustrates a flow chart of a method for closed-loop excitation target selection process 48 (FIG. 2) in accordance with a preferred embodiment of the present invention. The Closed-Loop Excitation Target Selection process begins in step 150 with the Select Error Measure step 152. The Select Error Measure step 152 selects an appropriate error measure (e.g., inner-product angle, linear correlation coefficient, Euclidean distance or less computationally intensive measures such as interpolation boundary or interpolation energy estimates). The Select Error Measure step 152 then computes the required parameters that relate to the selected error measure. In one embodiment of the invention, the Select Error Measure step 152 uses an amplitude boundary error measure and estimates the original excitation waveform amplitude boundary of an input analysis block for comparison against possible interpolation outcomes.

Next, the Select Candidate Target step 154 selects a candidate excitation target subframe within the input analysis block for analysis. For example, the candidate excitation target subframe may be selected as a first subframe of the analysis block that exceeds some minimum distance from an interpolation reference. During subsequent iterations of the Select Candidate Target step 154, the candidate excitation target subframe may be selected as a next adjoining subframe of the analysis block. The Estimate Interpolation Parameters step 156 uses the candidate target subframe and a source subframe (desirably a prior optimal target subframe) to estimate ensemble interpolation parameters that are pertinent to the chosen error measure.

After the Estimate Interpolation Parameters step 156, the Compute Error step 158 uses the relevant original excitation waveform parameters and the estimated companion interpolation parameters to compute the chosen error measure. The chosen error measure is desirably stored in memory.

A determination is then made in step 160 whether all appropriate candidate target subframes have been analyzed and an error corresponding to each candidate target subframe has been computed. If not, the Select Candidate Target step 154 is performed again, and the procedure iterates as shown in FIG. 8. If all errors have been computed in step 160, then the Select Optimal Excitation Target step 162 is performed.

The Select Optimal Excitation Target step 162 compares the stored error measurements and selects the optimal excitation target that minimizes the error measurement given interpolation. Instead of selecting interpolation targets randomly or based upon their position in the frame, the Select Optimal Excitation Target step 162 examines the overall interpolation error on a dynamic basis and makes decisions accordingly. In this manner, the candidate target which produces the lowest waveform error relative to the original excitation waveform is chosen as the optimal excitation target. The Select Optimal Excitation Target step 162 produces the optimal excitation target and, in order to maintain a low bit rate, a coarse approximation of the location of the optimal excitation target within the analysis block. As is obvious to one of skill in the art based on the description herein, the encoded coarse positioning information and encoded target are desirably transmitted to the synthesis device in order to preserve major excitation waveform features in the face of interpolation. The Closed-Loop Excitation Target Selection process then exits in step 164.

FIG. 9 illustrates exemplary waveforms derived using the closed-loop target selection method in accordance with a preferred embodiment of the present invention. Target epoch waveform 66 illustrates targets selected from varying locations within each frame, wherein each frame is shown separated by vertical dashed lines. The target epochs illustrated in target epoch waveform 66 have been selected to minimize a selected interpolation error measure. Synthesized excitation 68 is the result of interpolation between the selected target epochs of target epoch waveform 66. Synthesized excitation 68 may be compared to the original excitation 67.

3. Adaptive Excitation Weighting

Due to the asymmetrical, time-varying nature of the pitch-synchronous excitation target, commonly-used static weighting functions are not appropriate for pitch- synchronous modeling. Such fixed-envelope weighting functions can distort the fundamental excitation epoch characteristics that are necessary for high-quality reproduction of speech. In order to accommodate the time-var ing envelope characteristics of the synchronous excitation epochs, what is needed is an adaptive, time-varying weighting function which tailors its output toward the waveform characteristics of the excitation epoch being considered. The Adaptive Excitation Weighting process 50 (FIG. 2) provides a method of deriving a weighting function that is appropriate for pitch - synchronous modeling. The Adaptive Excitation Weighting process 50 (FIG. 2) benefits voice coding applications that utilize time and frequency domain excitation characterization techniques.

FIG. 10 illustrates a flow chart of a method for adaptively weighting an excitation waveform (step 50, FIG. 2) in accordance with a preferred embodiment of the present invention. The Adaptive Excitation Weighting process begins in step 170 by performing the Select Characterization Methodology step 172. The Select Characterization Methodology step 172 may choose between, for example:

• a characterization methodology that preserves primary or major excitation components at the expense of lower-level secondary components; or • a characterization methodology that preserves the secondary excitation components for characterization.

The above examples illustrate excitation characterization methodologies, however, other characterization methodologies may also be appropriate.

After the Select Characterization Methodology step 172, the Extract Excitation Portion step 174 extracts an appropriate excitation portion (e.g., a pitch synchronous portion or a single epoch) from the excitation waveform for application of the adaptive weighting function. Next, the Determine Excitation Features step 176 identifies the locations and relative amplitudes of primary and secondary components in the excitation portion. For example, to deterrnine the location of the primary component, the Determine Excitation Features step 176 could search for the excitation sample having the maximum absolute amplitude within the excitation portion. The excitation sample having the maximum amplitude is desirably classified as the location of the primary excitation component, and may be indicated by an index into the excitation portion.

The secondary component location may be determined, for example, by first dividing the excitation portion into segments and computing an energy measurement for each segment. The segment or segments that exceed a predetermined energy threshold relative to the primary segment energy are desirably classified as the locations of secondary components. The secondary component locations may be indicated by one or more indices into the excitation portion.

Next, the Create Adaptive Weighting Function step 178 uses the excitation portion length (in number of samples), primary and secondary component locations, and characterization methodology from the Select Characterization Methodology step 172 to create an adaptive weighting function that is adapted to the chosen excitation portion. The excitation portion length desirably defines the adaptive weighting function length, and consequently, the boundaries corresponding to the maximum adaptive weighting attenuation. The boundaries that indicate maximum attenuation of the adaptive weighting function may also be defined to reside within the excitation portion boundaries.

The locations of the primary and secondary components, in conjunction with the chosen characterization methodology, are used to determine the attenuation characteristics of the adaptive weighting function. Given the locations of the primary and secondary components in the corresponding excitation portion, a unit-amplitude variable-width rectangular window is desirably positioned relative to the locations of the primary excitation component and the secondary excitation component. The rectangular window offset may be made relative to the center of the adaptive weighting function length. The width of the rectangular window may vary between one sample and the adaptive weighting function length.

The offset, in conjunction with the rectangular window width, is selected so as to preserve the desired excitation components which are defined by the chosen characterization methodology. Two desirably sinusoidal attenuation characteristics are computed, one for application toward the excitation segment left of the offset variable- width rectangular window, and the other for application toward the excitation segment right of the offset variable-width rectangular window. These sinusoidal functions attenuate in a continuous fashion the undesired excitation components outside of the excitation preserved by the offset unit-amplitude rectangular window. The number of samples attenuated by the sinusoidal attenuation is determined by the chosen characterization methodology. FIG. 11 shows an example of an adaptive weighting function 190 derived in accordance with a preferred embodiment of the present invention. For example, a chosen characterization methodology may chiefly preserve the primary components at the expense of secondary components. The unit-amplitude window width, offset positioning and companion sinusoidal attenuation characteristics may be calculated so as to preserve only those primary components that are accurately represented by the characterization methods being employed. This approach serves to prevent possible aliasing problems arising from the presence of inaccurately characterized secondary excitation.

Referring again to FIG. 10, the Apply Adaptive Weighting Function step 180 applies the adaptive weighting function to the chosen excitation portion, resulting in a weighted excitation portion. FIG. 12 shows an example of an adaptive weighting function 200 derived in accordance with a preferred embodiment of the present invention as it relates to the excitation portion 202 from which it was derived. FIG. 13 shows an example of an adaptive weighting function 210 derived in accordance with a preferred embodiment of the present invention that would preserve more existing secondary excitation in a given excitation segment. Referring again to FIG. 10, the Adaptive Excitation Weighting process then exits in step 182.

Systems that use spectral pitch-synchronous excitation models can utilize the Adaptive Excitation Weighting process 50 (FIG. 2) to preserve the desired excitation components for characterization while providing a pre-Fast Fourier Transform window. The Adaptive Excitation Weighting process 50 (FIG. 2) provides the further benefit of minimizing epoch-to-epoch interpolation discontinuities which can arise in interpolating voice-coding schemes. Given the fundamental pitch of the excitation waveform, and the location of the major and secondary excitation features, the Adaptive Excitation Weighting process 50 (FIG. 2) will adjust its envelope to match the desired features of the underlying excitation.

A notable feature of the spectral excitation representation is the presence of modulating pseudo-sinusoidal components imposed upon the spectral envelope. These pseudo-sinusoidal components are due in part to the presence of complicating lower-level secondary components. Proper application of the Adaptive Excitation Weighting process 50 (FIG. 2) results in a multiplication/convolution duality, which pre-filters some of the modulating pseudo-sinusoidal components from the excitation frequency-domain envelope. Reduction of these pseudo-sinusoidal components enables any characterizing sampling process to better represent the overall envelope of the original spectral waveform. Depending upon the envelope characteristics of the excitation waveform, and the spectral characterization method being implemented, the Adaptive Excitation Weighting process 50 (FIG. 2) can adjust the applied envelope so as to maintain more or less of any existing secondary components. Hence, the Adaptive Excitation Weighting process 50 (FIG. 2) has been proven to reduce aliasing that can arise as a result of insufficient frequency-domain characterization of secondary excitation components.

In summary, this invention provides a method of encoding excitation waveform parameters for digital transmission that improves upon prior-art excitation parameterization techniques. Vocal excitation models implemented in most reduced- bandwidth vocoder technologies fail to reproduce the full character and resonance of the original speech, and are thus unacceptable for systems requiring high-quality voice communications.

The novel method is applicable for implementation in a variety of new and existing voice coding platforms that require more efficient and accurate excitation parameterization algorithms. Military voice coding applications and commercial demand for high-capacity telecommunications indicate a growing requirement for speech coding techniques that require less bandwidth while maintaining high levels of speech quality. The method of the present invention responds to these demands by facilitating high- quality speech analysis and synthesis at the lowest possible bit rates.

Thus, a method and apparatus for parameterization of speech excitation waveforms has been described which overcomes specific problems and accomplishes certain advantages relative to prior-art methods and mechanisms. The improvements over known technology are significant. Voice quality at low bit rates is enhanced. While a preferred embodiment has been described in terms of a telecommunications system and method, those of skill in the art will understand based on the description that the apparatus and method of the present invention are not limited to communications networks, but apply equally well to other types of systems where compression of voice or other signals is desired. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Accordingly, the invention is intended to embrace all such alternatives, modifications, equivalents and variations as fall within the spirit and broad scope of the appended claims.

Claims

CLAIMSWhat is claimed is:

1. A method of encoding excitation waveform parameters comprising the steps of: a) selecting an input analysis block comprising a plurality of excitation waveform samples; b) deriving a pulse compressed excitation waveform by filtering a portion of the input analysis block; c) estimating excitation parameters from the pulse compressed excitation waveform; d) selecting an excitation target by performing a closed-loop excitation target selection process that utilizes the excitation parameters; e) generating a weighted excitation portion by performing an adaptive excitation weighting process that creates an adaptive weighting function and applies the adaptive weighting function to the excitation target; and f) storing a bitstream derived from the weighted excitation portion.

2. A method of encoding excitation waveform parameters comprising the steps of: a) selecting an input analysis block comprising a plurality of excitation waveform samples; b) deriving a pulse compressed excitation waveform by filtering a portion of the input analysis block; c) estimating excitation parameters from the pulse compressed excitation waveform; d) selecting an excitation target by performing a closed-loop excitation target selection process that utilizes the excitation parameters; e) generating a weighted excitation portion by performing an adaptive excitation weighting process that creates an adaptive weighting function and applies the adaptive weighting function to the excitation target; and f) storing a bitstream derived from the weighted excitation portion.

3. A method of encoding excitation waveform parameters comprising the steps of: a) selecting an excitation template from a plurality of excitation waveform samples; b) determining excitation pulse compression filter coefficients from the excitation template; c) generating a pulse compressed excitation waveform by pulse compressing the plurality of excitation waveform samples using the excitation pulse compression filter coefficients; d) estimating excitation parameters from the pulse compressed excitation waveform; e) encoding data representative of the excitation parameters; and f) storing a bitstream derived from the data.

4. The method as claimed in claim 3, wherein step b) comprises the step of determining the excitation pulse compression filter coefficients for a matched excitation pulse compression filter.

5. The method as claimed in claim 3, wherein step b) comprises determining the excitation pulse compression filter coefficients for an all-pass compression filter by performing the steps of: b 1 ) producing a group delay estimate from the excitation template; b2) estimating a negative group delay estimate from the group delay estimate; and b3) determining the excitation pulse compression filter coefficients from a unit spectral magnitude and the negative group delay estimate.

6. The method as claimed in claim 3, wherein step b) comprises determining the excitation pulse compression filter coefficients for a whitening excitation pulse compression filter by performing the steps of: bl ) producing a spectral magnitude estimate and a group delay estimate from the excitation template; b2) estimating an inverse spectral magnitude estimate from the spectral magnitude estimate and estimating a negative group delay estimate from the group delay estimate; and b3) determining the excitation pulse compression filter coefficients from the inverse spectral magnitude estimate and the negative group delay estimate.

7. A method of encoding excitation waveform parameters comprising the steps of: a) selecting an input analysis block comprising a plurality of excitation waveform samples having excitation parameters; b) selecting an error measure; c) selecting a candidate target subframe from the input analysis block; d) estimating interpolation parameters related to the error measure between a source subframe and the candidate target subframe; e) computing an error between the interpolation parameters and the excitation parameters; f) repeating steps (c) through (e) for multiple candidate target subframes; g) selecting an optimal excitation target subframe corresponding to a minimal error between the interpolation parameters and the excitation parameters; h) encoding data representative of the optimal excitation target subframe; and i) storing a bitstream derived from the data.

8. A method of encoding excitation waveform parameters, the method comprising the steps of: a) extracting an excitation portion from excitation waveform samples; b) determining locations of excitation features within the excitation portion; c) creating an adaptive weighting function corresponding to the locations and a preselected characterization methodology; d) applying the adaptive weighting function to the excitation portion, resulting in a weighted excitation portion; e) encoding data representative of the weighted excitation portion; and f) storing a bitstream derived from the data.

9. The method as claimed in claim 8, wherein the excitation features comprise a primary excitation component and a secondary excitation component and step b) comprises the steps of: bl ) determining a primary excitation component location from the excitation portion; and b2) determining a secondary excitation component location from the excitation portion.

10. The method as claimed in claim 8, wherein the excitation features comprise a primary excitation component and a secondary excitation component and step c) comprises the steps of: c 1 ) defining a length of the adaptive weighting function; c2) determining a width of a variable-width rectangular window based on the preselected characterization methodology; c3) positioning the variable-width rectangular window based on the preselected characterization methodology with an offset from a center within the adaptive weighting function relative to a first location of the primary excitation component and a second location of the secondary excitation component in corresponding excitation portions; c4) computing at least one sinusoidal attenuation function based on the preselected characterization methodology, the length, the width, and the offset; and c5) implementing the at least one sinusoidal attenuation function on at least one side of the variable-width rectangular window, resulting in the adaptive weighting function.