WO1995028699A1

WO1995028699A1 - Differential-transform-coded excitation for speech and audio coding

Info

Publication number: WO1995028699A1
Application number: PCT/CA1995/000216
Authority: WO
Inventors: Jean-Pierre Adoul; Claude Laflamme; Redwan Salami; Roch Lefebvre
Original assignee: Universite De Sherbrooke
Priority date: 1994-04-19
Filing date: 1995-04-18
Publication date: 1995-10-26
Also published as: CA2121667A1; AU2250995A

Abstract

A method for coding speech is disclosed. This method, called DTCX (Differential-Transform-Coded Excitation), combines in a new way, the best features of time domain techniques such as CELP (Code-Excited Linear Prediction) with the best features of frequency-domain techniques such as TC (Transform coding) while avoiding their respective drawbacks. The invention preserves the principle of error minimization in the perceptually-weighted-speech(/audio) domain which is found in CELP along with techniques such as linear filtering and pitch prediction, yet, it circumvents the complexity of the CELP analysis-by-synthesis approach by using quantization. The invention also takes advantage of the efficient frequency-domain differential-quantization techniques typical of transform coding (TC) such as spectral decimation, flexible bit allocation as well as numerous forms of stored or algebraic vector quantization techniques. In addition, it is the difference between the current and previous spectra which is quantized resulting in enhanced performance in particular for audio coding. Yet unlike TC, the invention is essentially free from framing problems that plague block transforming of continuous processes.

Description

DIFFERENTIAL-TRANSFORM-CODED EXCITATION

FOR SPEECH AND AUDIO CODING

BACKGROUND OF THE INVENTION

1. Field of the invention:

The present invention relates to a new technique for digitally encoding and decoding in particular, but not exclusively, speech and audio signals in view of transmitting and synthesizing these speech and audio signals.

2. Brief description of the prior art:

Efficient digital speech codingtechniques with good subjective quality/bit-rate tradeoffs are increasingly in demand for numerous applications. Recently, CELP [Schroeder, M.R. & B. Atal, "Code- Excited Linear Prediction (CELP) high-quality speech at very low bit rates", IEEE ICASSP 1985] and Algebraic CELP [Adoul, J.-P. & Laflamme, C, "Dynamic Codebook For Efficient Speech Coding Based on Algebraic Codes", WO 91/13432 published on September 5, 1991] techniques have been developed successfully for voice transmission at rates between 4 to 8 kbps for applications such as land mobile, digital radio, secure telephony etc. However, (unless block sizes are reduced to but a few samples) CELP becomes impractical above 8 kbps as codebook sizes and search times increase exponentially with bit rate. Differences and similarities between the prior art and the present invention:

The present invention, called DTCX (Differential-Transform-Coded Excitation) , retains several features of CELP but circumvents the complexity limitation (DTCX's complexity tends to decrease with bit rate) .

Along with CELP, the invention belongs to the "excited linear prediction (LP) " techniques. In this class of techniques, the reconstructed (i.e.: decoded) signal is obtained by exciting a slowly-varying linear prediction (LP) filter also referred to as "synthesis filter". The excitation being a LP-residual (i.e.: whitened) signal, the signal-to-noise performance of this type of coder reaps readily the full benefit of the linear-prediction gain.

TCX (Transform-coded Excitation) and CELP use, however, opposite search strategies to achieves these common goals. To understand the fundamental difference, it is best to refer to the, so-called, "backward filtering" formulation of CELP [Adoul, J.- P.; Mabilleau, Ph.; Delprat, M. ; and Morissette, S. ; "Fast CELP Coding Based on Algebraic Code", IEEE ICASSP 1987] . In this formulation, a "target signal" is computed. Simply put, the coding problem is to find the winner, that is, the particular innovative component which, once LP-synthesized, will be the closest (in mean-squared-error sense) to this, appropriately called, target signal.

The way CELP solves the problem is called

"analysis-by-synthesis" . In this approach, each possible innovative component (i.e. codebook entry) is LP-filtered one-by-one to yield the winner.

By contrast, and this is the crux, the invention takes the reverse path. Namely, the target signal itself is differentially quantized and the winning innovation component reached by (single) inverse LP-filtering of this quantized target.

There is still another very useful feature, so far left unmentioned, that the invention shares with CELP and which constitutes a distinct improvement over older "excited linear prediction (LP) " techniques such as RELP (Residual Excited LP) . This has to do with the question of properly chaining-up the reconstructed output blocks in the context of block processing. The target signals are not sample blocks taken out from a continuous process, but are so called "decontextualized" signals (free of edge, or "ringing" considerations) .

The fact that the invention is based on the differential quantization of a target signal offers the distinct possibility to take advantage of efficient frequency-domain quantization techniques typical of Transform Coding (TC) while, staying essentially free, from framing problems that plague block transforming of continuous processes. As a matter of fact, statistically invariant properties of speech (/audio) are often more readily usable in the frequency domain. This fact enables many efficient coding techniques including spectral decimation, flexible bit allocation as well as numerous forms of stored or algebraic vector quantization techniques. OBJECTS OF THE INVENTION

The main object of the invention is to formulate a general speech/audio-coding framework which combines in a new way the advantages of both the most efficient time-domain and frequency-domain analysis and encoding methods.

It is also an object of the invention to describe typical examples of perceptually-meaningful differential quantization procedures in the frequency domain to be used within the said general framework.

A further object of the invention is to provide an "excited linear prediction (LP) " technique using short-term (and, possibly, long-term) prediction analysis to obtain a residual (i.e. whitened) signal to which a series of perceptual and frequency transformations are applied in order to perform both a perceptually-meaningful and efficient differential quantization procedure in the frequency domain.

SUMMARY OF THE INVENTION

More specifically, in accordance with the present invention, there is provided a method of coding a sound signal to produce an index signal to be decoded into an excitation signal to be supplied to a synthesis filter to synthesize the sound signal, comprising the steps of: converting the sound signal into a frequency-domain signal by means of a given frequency transform; subtracting a previous frequency-domain signal produced by the converting step, from a current frequency-domain signal produced by this converting step to generate a difference signal; and conducting a spectral quantization on the difference signal to produce the index signal.

Also in accordance with the present invention, there is provided a device for coding a sound signal to produce an index signal to be decoded into an excitation signal to be supplied to a synthesis filter to synthesize this sound signal, comprising: first means for converting the sound signal into a frequency-domain signal by means of a given frequency transform; second means for subtracting a previous frequency-domain signal produced by the converting means, from a current frequency-domain signal produced by these converting means to generate a difference signal; and third means for spectrally quantizing the difference signal to produce the index signal.

According to preferred embodiments of the invention:

- the difference signal is quantized using a weighted mean-squared error criterion.

- the sound signal is sampled and arranged into frames of N consecutive samples applied to the converting step, N being an integer; - a pitch-correlated component based on a past excitation signal is produced and subtracted from the sound signal prior to spectral quantization;

the sound signal is perceptually weighted through a filter means, or the difference signal is perceptually weighted through the spectral quantization which is based on a weighted-distortion measure;

a ringing component is produced and removed from the sound signal prior to spectral quantization, this ringing component being a current effect of quantization errors incurred in previous sample frames;

spectral quantization comprises a decimation step; and

- spectral quantization comprises decomposing the difference signal into amplitude and phase components prior to quantization, quantizing the amplitude components through at least one stored or algebraic vector quantization technique, and quantizing the phase components with either a lattice or a trellis based on a weighted cosine distortion measure.

The objects, advantages and other features of the present invention will become more apparent upon reading of the following non restrictive description of preferred embodiments thereof, given by way of example only with reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS

In the appended drawings:

Figure 1 is a schematic block diagram of a general speech/audio-coding framework in accordance with the present invention, describing the coder (Note that the coder incorporates a local decoder. Hence, the decoder structure is not repeated.);

Figure 2 provides details for a typical implementation of a pitch model of Figure la;

Figures 3 and 4 provides two alternate approaches for implementing the perceptually-weighted differential-transform quantization in accordance with the general speech/audio-coding framework introduced in the present invention;

Figure 5 shows details for quantizers of Figures 3 and 4; and

Figures 6 and 7 show alternate methods to remove the ringing component.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Figure 1 illustrated a schematic block diagram for the general speech (/audio) encoding framework in accordance with the present invention.

Before being encoded by the device of Figure 1, an analog input speech or audio signal is band filtered and sampled at the Nyquist rate (e.g. 8 kHz for telephony and 16 kHz or more for wideband applications) . The resulting signal comprises a train of samples of varying amplitudes represented by 12 to 16 bits of a digital code.

Transmission is based on the encoding of blocks of N consecutive samples referred to as frames (e.g., N = 48 samples) . Within a frame, samples are numbered by index n (e.g., n = 0, 1, ...N-l).

Let s[n] be the input signal (i.e.: sampled speech or audio) for the current frame. Let also s[n] be the corresponding received signal, also called the synthesized signal as it is outputted by the synthesis filter 5. Before encoding the current input frame, the received signal, s[n] , is known up to n = -1 (or equivalently, up to n = N-l of the previous frame) . Furthermore, it is already known that previous encodings will add the fortuitous component, z[n] , to the received signal during the current frame. This phenomenon is referred to as the "ringing component" in CELP literature. Basically, quantization errors of previous frames are still causing some "ringing" at the output of the synthesis filter.

To take into account this phenomenon, z [n] , is first removed from s[n] (see 100). The difference signal, s[n]-z[n], is filtered by an analysis filter 1 to produce a residual signal r[n] . The purpose of the analysis filter 1 is to whiten the residual signal. Let A(z) be the transfer function of analysis filter 1. It is changed from frame to frame to take into consideration the varying spectral content of the input signal. Typically, A(z) is an m^th order FIR (finite impulse response) filter whose m coefficients are obtained using the well known autocorrelation method according to either a forward or backward approach. At frame k, the transfer function of the synthesis filter 5 is l/A_k(z) , that is, the exact inverse of the analysis filter A_k(z) .

Next, a pitch-prediction component, p[n] , is removed from the residual r[n] (see 101). The resulting difference signal, v[n] = r[n] - p[n] , is inputted to the "Perceptually-weighted spectral-quantization" module 3 which produces the digital output, i [n] which is transmitted to the decoder 102. From this digital output, the decoder will be able to retrieve the quantized version of the difference signal, v[n] , by the inverse transformation 4. Adding back the pitch-prediction component, p[n] to the signal v[n] (see 103) yields the quantized version of the residual, r[n] , also called the "excitation" of the synthesis filter 5.

The pitch prediction component, p[n] , is produced by a pitch prediction module 6 which is detailed in Figure 2. The pitch analysis is similar to that used in CELP/ACELP coders. However, this invention proposes a new variant which improves performance particularly in the case of backward analysis/synthesis filter adaptation. In the prior art, pitch lag, T (typically: T e {20, 21, ... 147} for telephone applications) , and prediction gain G_p which minimize the following expression are searched.

N-l min ∑ (_r[__] - G_pf [n - T] ) The improvement introduced in this invention consists of considering the "corrected" excitation, r[n-T], instead of the traditional, r[n- T] . As illustrated on Figure 2, the "corrected" excitation, f[n-T], is obtained by supplying the signal s [n] to a pitch delay buffer 61 to obtain a synthesized output s[n-T] , and filtering the synthesized output s[n-T] with the current analysis filter, A_k(z) (see 62) . The "corrected" excitation, f[n-T] is then amplified (gain G_p 63) to obtain the pitch prediction component p[n] .

Note that, s [n-T] , belongs to some previous frame, say, frame k-j (i.e.: j = 1, 2, ...). Therefore, it was synthesized from f[n-T], using l/A_k-_j(z), a filter possibly very different from the inverse of A_k(z) when speech undergoes rapid spectral changes.

Figure 3 describes the perceptually-weighted spectral quantization module 3 of Figure 1. The ultimate purpose of this module is simply to quantize v[n] into v[n], in both the most efficient and the most subjectively-meaningful way possible.

Spectral quantization (i.e.: quantization performed in the frequency-domain) is used for its efficiency. Among other advantages, it allows dimensionality reduction.

Turning now to the concern for subjectively meaningful quantization, it is well known in the CELP literature, that minimizing the error in the so-called weighted speech (/audio) domain is subjectively-meaningful; the widely used weighting filter being of the following form, where γ is scalar constant typically between 0.7 and 0.9.

W(z) = A (z)

A (zy- )

Unlike CELP, the present invention uses quantization. However the quantization seeks also to minimize the (quantization) error in the weighted speech (/audio) domain.

In Figure 3, a filter F(z) 30 provides the needed weighting. As a matter of fact, by setting its transfer function to F(z) =— ^—- this filter

- zγ-¹⁾ combines with the synthesis filter 5 (Figure 1) to create the desired global weighting: A(z)F(z) = W(z).

The filter F(z) 30 is followed by a transform such as the odd DFT (Discrete Fourier Transform) 31. Any (orthonormal) transformation can be used with various measure of success, these include (but do not exhaust) traditional DFT, cosine, Hadamard, Karhunen-Loeve, SVD ... transforms. The transform output, X[j], is a spectral signal with frequency-domain index j. The transform output X' [j] from previously received subframe is removed (see 33) from transform output X[j], and the difference is quantized according to a MSE (mean square error) distortion (see block 32). The index, i = i(k) outputted by the quantizer 32 constitutes the digital codes at frame k. From this index, the decoder will retrieve the .(best) quantization value Xi[j] which will yield v[n] after applying successively the inverse transform and the inverse filtering (i.e.: 1/F(z) with zero initial state) (see Figure 1) . Figure 4 describes an alternate approach for implementing the perceptually-weighted spectral-quantization module 3. In this approach, the (spectral) weighting is no longer applied through filtering; it is introduced instead in the distortion measure of the quantizer. Consequently, the difference signal, v[n] , is directly applied to the frequency transform. The odd DFT 34 is used in Figure 4 for illustration purposes. Again any transformation can be used with various measure of success. Finally, the transform output X' [j] from previously received subframe is removed (see 35) from the transform output, X[j] (a vector of N/2 complex components in the odd-DFT case) , and the difference is quantized (see 36) using a weighted mean-squared error criterion.

Where, q[j] is the weight vector. Taking, q[j] equal to the module of F(z) =— Λ(zγ^_1-) evaluated at z = exp(i27rj/N) will result in a structure functionally equivalent to that of Figure 3.

Note that in the alternate approach of Figure 4, any weighting filter, W(z) (i.e.: dimmed ideal), can be implemented by taking, q[j], equal to the module of F(z) = W(z)/A(z) since W[z] no longer needs be known at the receiver. Note further that, q[j] , can implement any spectral weighting based on current and passed frames. The spectral quantizer modules of Figures 3 and 4 (i.e. : modules 32 or 36) can be implemented in various ways. In particular it can make use of one, or a combination of Vector Quantization (VQ) technique(s) including, but not limited to, the stored-VQ variety (e.g.: Gain/shape VQ, tree-structured VQ, multistage VQ, split VQ) and the algebraic variety (e.g. : lattice quantization (Q) , trellis-coded Q, permutation Q) .

Figure 5 details one typical implementation for module 32 or 36. The difference between the (complex) spectral signal, X[j], and the received spectral signal X' [j] from the previous subframe is first computed. This difference is decimated according to a rule specified by index i], in module 50. The (dimensionally) reduced difference spectral signal is decomposed into amplitude 51 and phase 52 components prior to quantization. The amplitudes, |X[j] |, are then quantized by one or a combination of Vector Quantization techniques (module 53) of the stored or algebraic varieties. The phase components, [ ] , are quantized (module 54) with either a lattice (e.g. : Gosset, Barnes-Wall, Leech ...) or a trellis based on the following novel criterion called weighted cosine distortion measure.

Max ∑ q² [j] X[j]I & [ ]I cos(φ [j] - _i3[j])

Where hats refer to the quantized values. Weighting vector, q[j] , should be omitted for the MSE quantizer 32 of Figure 3. Indexes ±₁ , i₂ and i₃ are then multiplexed (module 55) to yield the (global) differential spectral quantizer index, i = i(k) , at frame k.

As indicated in the foregoing description, DTCX (Differential-Transform-Coded Excitation) , retains several features of CELP but circumvents the complexity limitation of CELP (DTCX's complexity tends to decrease with bit rate) . CELP uses the approach called "analysis-by-synthesis" in which each possible innovative component (i.e. codebook entry) is LP-filtered one-by-one to yield the winner; by contrast the present invention takes a reverse path in which the target signal itself is differentially quantized and the winning innovation component reached by (single) inverse LP-filtering of this quantized target. The differential quantization of a target signal offers the distinct possibility to take advantage of efficient frequency-domain quantization techniques typical of Transform Coding (TC) while, staying essentially free, from framing problems that plague block transforming of continuous processes. As a matter of fact, statistically invariant properties of speech (/audio) are often more readily usable in the frequency domain. This fact enables many efficient coding techniques including spectral decimation, flexible bit allocation as well as numerous forms of stored or algebraic vector quantization techniques.

Figures 6 and 7 describes two alternate methods for removing the ringing component among variants of the method used in Figures 1 and 2. The ringing computation is based on the discrepancy between quantized and unquantized signals (i.e.: v[n] - v[n] or r[n] - r[n] or s [n] - s[n] ) and the proper ringing can be remove from the residual as in Figure 6 (or added to the p[n] ) . If weighting filter F(z) 30 of Figure 3 is implemented, it is an elegant solution to remove the proper ringing from the initial filter state as illustrated in Figure 7.

Although the present invention has been described hereinabove by way of a preferred embodiment thereof, this embodiment can be modified at will, within the scope of the appended claims, without departing from the spirit and nature of the subject invention.

Claims

The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows:

1. A method of coding a sound signal to produce an index signal to be decoded into an excitation signal to be supplied to a synthesis filter to synthesize said sound signal, comprising the steps of: converting said sound signal into a frequency-domain signal by means of a given frequency transform; subtracting a previous frequency-domain signal produced by the converting step, from a current frequency-domain signal produced by said converting step to generate a difference signal; and conducting a spectral quantization on said difference signal to produce the index signal.

2. The method of claim 1, wherein said difference signal is quantized using a weighted mean- squared error criterion.

3. The method of claim 1, further comprising the step of sampling the sound signal and arranging said sampled sound signal into frames of N consecutive samples applied to said converting step, N being an integer.

4. The method of claim 1, further comprising the step of producing a pitch-correlated component based on a past excitation signal, and the step of subtracting said pitch-correlated component from said sound signal prior to said spectral quantization.

5. The method of claim 1, further comprising the step of perceptually weighting said sound signal through a filter means.

6. The method of claim 1, further comprising the step of perceptually weighting said difference signal through said spectral quantization which is based on a weighted-distortion measure.

7. The method of claim 6, comprising the steps of sampling said sound signal and arranging said sampled sound signal into frames of N consecutive samples, N being an integer, wherein said weighted-distortion measure uses a weighting filter at a current frame.

8. The method of claim 6, comprising the steps of sampling said sound signal and arranging said sampled sound signal into frames of N consecutive samples, N being an integer, wherein said weighted- distortion measure implements a spectral weighting based on current and past frames.

9. The method of claim 1, further comprising the steps of: sampling said sound signal; arranging said sampled sound signal into frames of N consecutive samples, N being an integer; and producing a ringing component and removing said ringing component from said sound signal prior to said spectral quantization, said ringing component being a current effect of quantization errors incurred in previous sample frames.

10. The method of claim 9, further comprising the step of perceptually weighting said sound signal through a filter, wherein said ringing component is removed by modifying an initial state of said filter.

11. The method of claim 1, in which said spectral quantization uses a reversible frame transform.

12. The method of claim 11, in which said reversible block transform is selected from the group consisting of discrete Fourier transform, odd discrete Fourier transform, cosine transform, Karhunen-Loeve transform, SVD transform.

13. The method of claim 1, in which said spectral quantization comprises a decimation step.

14. The method of claim 1, in which said step of conducting a spectral quantization comprises using one, or a combination of vector quantization techniques.

15. The method of claim 14, in which said quantization techniques are selected from the group consisting of gain/shape vector quantization, tree-structured vector quantization, multistage vector quantization, split vector quantization, lattice quantization, trellis-coded quantization, and permutation quantization.

16. The method of claim 1, in which said step of conducting a spectral quantization comprises: decomposing said difference signal into amplitude and phase components prior to quantization; quantizing the amplitude components through at least one stored or algebraic vector quantization technique; and quantizing the phase components with either a lattice or a trellis based on a weighted cosine distortion measure.

17. The method of claim 6, comprising improving the pitch-correlated component by considering a refined version of the past excitation signal which reflects a current transfer function of the synthesis filter, said refined version of the excitation signal being obtained by inverse filtering the past synthesized signal.

18. A device for coding a sound signal to produce an index signal to be decoded into an excitation signal to be supplied to a synthesis filter to synthesize said sound signal, comprising: first means for converting said sound signal into a frequency-domain signal by means of a given frequency transform; second means for subtracting a previous frequency-domain signal produced by the converting means, from a current frequency-domain signal produced by said converting means to generate a difference signal; and thirdmeans for spectrally quantizing said difference signal to produce the index signal.

19. A device as recited in claim 18, wherein said third means comprises means for quantizing said difference signal using a weighted mean-squared error criterion.

20. A device as recited in claim 18, further comprising means for sampling the sound signal and means for arranging said sampled sound signal into frames of N consecutive samples applied to said converting means, N being an integer.

21. A device as recited in claim 18 further comprising means for producing a pitch- correlated component based on a past excitation signal, and means for subtracting said pitch-correlated component from said sound signal prior to said spectral quantization.

22. A device as recited in claim 18, further comprising filter means for perceptually weighting said sound signal.

23. A device as recited in claim 18, further comprising means for perceptually weighting said difference signal through said spectral quantization which is based on a weighted-distortion measure.

24. A device as recited in claim 18, further comprising: means for sampling said sound signal; means for arranging said sampled sound signal into frames of N consecutive samples, N being an integer; means for producing a ringing component which is a current effect of quantization errors incurred in previous sample frames; and means for removing said ringing component from said sound signal prior to said spectral quantization.

25. A device as recited in claim 24, further comprising filter means for perceptually weighting said sound signal, wherein said ringing component removing means comprises means for modifying an initial state of said filter means to remove said ringing component.

26. A device as recited in claim 18, in which said third means comprises: means for decomposing said difference signal into amplitude and phase components prior to quantization; means for quantizing the . amplitude components through at least one stored or algebraic vector quantization technique; and means for quantizing the phase components with either a lattice or a trellis based on a weighted cosine distortion measure.

27. A device as recited in claim 21, comprising means for improving the pitch-correlated component by considering a refined version of the past excitation signal which reflects a current transfer function of the synthesis filter, said refined version of the excitation signal being obtained by inverse filtering the past synthesized signal.