WO2000057401A1

WO2000057401A1 - Computation and quantization of voiced excitation pulse shapes in linear predictive coding of speech

Info

Publication number: WO2000057401A1
Application number: PCT/CA2000/000287
Authority: WO
Inventors: Mohammad Aamir Husain; Bhaskar Bhattacharya
Original assignee: Glenayre Electronics, Inc.
Priority date: 1999-03-24
Filing date: 2000-03-15
Publication date: 2000-09-28
Also published as: AU3411000A

Abstract

The invention facilitates linear predictive coding of voiced speech sounds, with a single voiced excitation code book encompassing a wide range of different pitch periods. Unlike prior art techniques which represent voiced excitation using fixed pulse shapes with random variations added, the invention uses a single code book of representative frequency spectra to closely match the original unquantized pulse shape. In particular, the invention facilitates determination of a pulse shape vector v for a linear predictive speech coder from a voiced residue pulse u, during a sampling instant n characterized by a gain g and a pitch period p. A spectral magnitude vector Suq of dimension dsm is derived (38) to represent the frequency spectral magnitude of the pulse during the sampling instant. A code book Csm (46) containing a plurality of vectors representative of pre-selected spectral magnitude vectors is provided. A vector which provides a minimum error approximation to Suq is selected from the code book (40). ism is the spectral magnitude index, within the code book, of the selected minimum error approximation vector. A quantized spectral magnitude vector S having the spectral magnitude index ism and having dsm elements is then derived (42). A complex frequency spectrum signal X is derived from S and is converted to a complex time-domain representation x. The pulse shape vector v is then derived from the Real components of x.

Description

COMPUTATION AND QUANTIZATION OF VOICED

EXCITATION PULSE SHAPES IN LINEAR

PREDICTIVE CODING OF SPEECH

Technical Field

This invention is directed to linear predictive coding of voiced speech sounds. A single code book containing code words representative of different frequency spectra facilitates reconstruction of speech sounds, irrespective of pitch differences in such sounds.

Background

Linear Predictive Coding (LPC) of speech involves estimating the coefficients of a time varying filter (henceforth called a "synthesis filter") and providing appropriate excitation (input) to that time varying filter. The process is conventionally broken down in two steps known as encoding and decoding.

As shown in Figure 1, in the encoding step, the speech signal s is first filtered by pre-fϊlter 10. The pre-filtered speech signal s_p is then analyzed by LPC Analysis block 14 to compute the coeffi- cients of the synthesis filter. Then, an "analysis filter" 12 is formed, using the same coefficients as the synthesis filter but having an inverse structure. The pre-filtered speech signal s_p is processed by analysis filter 12 to produce an output signal u called the "residue". Information about the filter coefficients and the residue is passed to the decoder for use in the decoding step.

In the decoding step, a synthesis filter 18 is formed using the coefficients obtained from the encoder. An appropriate excitation signal e is applied to synthesis filter 18 by excitation generator 16, based on the information about the residue obtained from the encoder. Synthesis filter 18 outputs a synthetic speech signal y, which is ideally the closest possible approximation to the original speech signal s.

The present invention pertains to excitation generator 16 and to the way in which information about the residue passes from the encoder to the decoder. Analysis filter 12 and synthesis filter 18 are exact inverses of each other. Therefore, if the residue signal u were applied directly to synthesis filter 18, the decoder would exactly reproduce the pre-filtered speech signal s_p. In other words, if the precise residue signal u could be transferred from the encoder to the decoder, then the synthetic speech output signal y would be of very high quality (i.e. as good as the pre-filtered speech signal s_p). However, bandwidth restrictions necessitate quantization of the residue signal u, which unavoidably distorts the excitation signal e and the resultant synthetic speech signal y. Excitation generator 16 incorporates both a "voiced" excitation generator, and an "un-voiced" excitation generator. The quantization process exploits structural differences between voiced and unvoiced components of the residue. The voiced residue is quasi- periodic, while the unvoiced residue is like a randomly varying signal. The present invention deals particularly with quantization of the voiced residue, and corresponding generation of voiced excitation in the decoder.

The voiced residue can be described in terms of three parameters for quantization purposes: pitch, p_u; gain, g; and, the shape of a single cycle, called the pulse shape. Pitch refers to the periodicity of the signal and is equal to the distance between subsequent pulses in the residue signal u. Gain refers to the energy of the signal and is higher for a residue having higher energy. The pulse shape is the actual geometric shape of each pulse (a single cycle) in the voiced residue. A typical voiced residue signal is shown in Figure 2.

Prior art LPC coding techniques have quantized pitch and gain parameters, but have achieved only poor representation of pulse shapes. For example, early LPC coders used single unit impulses to represent pulse shape (Markel, J.D. and Gray, A.H. Jr., "A Linear Prediction Vocoder Simulation Based Upon the Autocorrelation Method", IEEE Trans. ASSP, Vol. 22, 1974, pp. 124-134); the LPC- 10 government standard (U.S. Government Federal Standard 1015, 1977) represented each pulse by a fixed shape; and more recently, excitation pulse shapes have been represented as a sum of a fixed shape and random noise (McCree, A.V. and Barnwell III, T.P., "A Mixed Excitation LPC Vocoder Model for Low Bit Rate Speech Coding," IEEE Trans, on Speech and Audio Processing, Vol. 3, No. 4, July 1995, pp. 242-250). Pulse trains constructed from such restricted shapes provide a poor representation of the variations in pulse shapes observed in residual signals output by analysis filter 12, as is evident from the sample residue signal shown in Figure 2. A common technique known in the art of speech coding is

"vector quantization", in which a vector of samples (e.g. a signal segment) is represented as one of a predetermined set of vectors called "code words". All of the code words are assembled to form a table called a "code book". The difficulty in using a standard vector quantization approach is that the pulse shapes required to be represented in LPC based speech coding are not of fixed length, but vary with pitch period. In principle, one could construct a plurality of code books, one for each possible value of pitch period, but this approach requires too many code books. It is impractical in many cases to use multiple code books due to memory limitations of the hardware in which speech encoding and decoding capabilities are preferably pro- vided. For example, large integrated circuit memory chips have relatively high power consumption requirements which cannot be satisfied in small battery powered systems such as voice pagers, which must remain active for months between battery replacements.

This invention provides improved representation of pulse shapes in LPC coding of voiced speech, irrespective of pitch period variations, and requires only a single code book. The dashed line shown in Figure 1 represents the transfer of information about the residue from analysis filter 12 to excitation generator 16. Figure 3 depicts this transfer in greater detail in respect of the aforementioned pitch, gain and pulse shape parameters. However, the present invention focuses only on transfer of an improved pulse shape parameter in LPC coding of voiced speech sounds.

Summary of Invention The invention facilitates good quality LPC coding of voiced speech sounds through better quantization of excitation pulse shapes for all possible pitch periods. Unlike prior art techniques which use fixed shape excitation pulses, or excitation pulses formed by adding random noise to a fixed shape, the invention utilizes a novel frequency domain code book with code words representative of signal frequency spectra, to select a pulse shape that closely matches the original pulse shape from the residue signal. In particular, the invention provides a method of determining a pulse shape vector v for a linear predictive speech coder from a voiced residue pulse v_uq, during a sampling instant n characterized by a gain g and a pitch period p_u. A spectral magnitude vector S_uq of dimension d_sm is derived to represent the frequency spectral magnitude of the pulse during the sampling instant. A code book C_^, containing a plurality of vectors representative of pre-selected spectral magnitude vectors is provided. A vector which provides a minimum error approximation to S_uq is selected from the code book. i_sm is the spectral magnitude index, within the code book, of the selected minimum error approximation vector. A quantized spectral magnitude vector S having the spectral magnitude index i_sm and having d_sm elements is then derived. A complex frequency spectrum signal X is derived from the quantized spectral magnitude vector S and the quantized pitch period p. This in turn is converted to a complex time domain representation x. The pulse shape vector v is then derived from the Real components of x.

Brief Description of Drawings Figure 1 is a block diagram representation of a prior art

LPC based speech encoder /decoder.

Figure 2 depicts a typical voiced residue signal waveform and the shapes of individual pulses found in typical voiced residue/excitation signals. Figure 3 is a block diagram representation of the information pathway over which information respecting the voiced residue is transferred from the encoder to the decoder in the preferred embodiment of the invention.

Figure 4 is a block diagram representation showing further details of the pulse shape encoder and pulse shape decoder blocks depicted in Figure 3.

Figure 5 graphically depicts interpolation of a harmonics vector, in accordance with the invention, to produce a spectral magnitude vector for cases in which the dimension of the harmonics vector is less than the desired dimension of the spectral magnitude vector. Figure 6 graphically depicts decimation of a harmonics vector, in accordance with the invention, to produce a spectral magnitude vector for cases in which the dimension of the harmonics vector exceeds the desired dimension of the spectral magnitude vector.

Description

Introduction

As previously explained, the pre-filtered signal, s_p, (Figure 1) is obtained by passing the original speech signal, s, through a pre-processing filter 10. The residue, u, is obtained by passing the pre-filtered signal, s_p, through a time-varying all-zero LPC analysis filter 12. The coefficients applied to filter 12 are obtained by LPC analyzer 14 using techniques which are well known to persons skilled in the art and need not be described here.

If, at any desired sampling (time) instant, n, the original speech signal s is classified as voiced (using techniques which are well known in the art), then a pulse-shape vector v_uq is obtained as described below for that particular sampling instant. The energy at any sampling instant, n, is represented by a gain, g, corresponding to the root mean square value of the residue over a window (typically having a length of 80-160 samples) centred at the sampling instant, n. The pitch period at any sampling instant, n, as determined in the speech encoder, is denoted by p_u and the quantized pitch at the speech decoder is denoted by p.

More particularly, as seen in Figure 3, voicing and gain analyzer 20 receives original speech signal s and residue u, and outputs signals representative of pitch period p_u, gain g and pulse-shape vector v_uq respectively. On the encoder side, pitch encoder 24 processes pitch period p_u for further processing by pitch decoder 34 on the decoder side to yield quantized pitch p, which is in turn input to the decoder's voiced excitation generator 22. Pulse shape encoder 28 processes pulse-shape vector v_uq for further processing by pulse shape decoder 30 to yield pulse shape vector v for input to voiced excitation generator 22. Gain encoder 26 processes the gain characteristic of the signal output by voicing and gain analyzer 20 for further processing by gain decoders 32, 36 which respectively yield the gain g for input to voiced excitation generator 22 (on the decoder side) and pulse shape encoder 28 (on the encoder side). The operation of pulse shape encoder 28 and pulse shape decoder 30 will now be described in further detail, with reference to Figure 4.

Computation of Spectral Magnitude Vectors A spectral magnitude vector, S_uq, is obtained (Figure 4, block 38) as follows. First, an unquantized time domain pulse shape vector, v_uq, is determined as: , .. u (n-l (p_u-l) /2i+j) v ( 7 ) = ^u«r ^; ₁₀ (ff/20)

forj=0,...,p_u-l

A complex spectrum signal, V_uq, which is a complex vector of dimension, p_u, is then obtained by taking a /? ..-point Discrete Fourier Transform (DFT) of v_uq. A harmonics vector, H_uq, of dimension, d_h, is then obtained from V . More particularly:

. , d_h

The spectral magnitude vector, S_uq, (of dimension, d_sm=64, in the preferred embodiment of the invention), is obtained from the harmonics vector, H_uq, by interpolation or decimation. Three cases must be considered, namely those in which d_h=d_sm, those in which d_h <d_sm, and those in which d^ d_^. If d_h=d_sm, then S_uq is set equal to H_uq. Note that the number of harmonics, d_h, is related to pitch, is time variant, and varies with individual speakers, whereas d_m is fixed.

Figure 5 illustrates the interpolation process for the case d_h <d_sm, for representative values of d_h=9 and d_sm= \4. The two end elements ss_}, ss₉ of a source sequence of d_h elements (upper portion of Figure 5) are initially repositioned (central portion of Figure 5) to coincide with the end elements tSj, ts₁₄, of the desired target sequence (lower portion of Figure 5). The source sequence elements are equi- spaced, as are the target sequence elements, although the spacings are of arbitrary size in each sequence. Then, the source sequence elements between the end points are copied to the nearest element positions in the target sequence. Thus, source sequence elements ss_{l f} ss₂, ss₃, and ss₄ depicted in the central portion of Figure 5 are copied to produce target sequence elements ts_}, ts₃, ts₅, and ts₆ respectively, as depicted in the lower portion of Figure 5. Since d_h<d_sm (i.e. 9 < 14), some empty positions, such as ts₂ and ts₄ remain in the target sequence. These empty positions are filled by inserting values obtained by interpolation between the closest adjacent target sequence values copied from the source sequence. Thus, the value inserted in empty position ts₂ is obtained by interpolation between the previously copied target sequence elements ts_}, ts₃; and, the value inserted in empty position ts₄ is obtained by interpolation between the previously copied target sequence elements ts₃, ts₅, etc.

Figure 6 illustrates the decimation process for the case d_h>d_sm, fo^f representative values of d_h=25 and d_sm=S. The two end elements ss_j, ss₂₅ of the source sequence of d_h elements (upper portion of Figure 6) are initially repositioned (central portion of Figure 6) to coincide with the end elements ts_lf ts₂₅ of the desired target sequence (lower portion of Figure 6). Then, the source sequence elements between the end points are copied to the nearest element positions in the target sequence. Since d_h > d_sm (i.e. 25 > 8), some target sequence positions (in the case illustrated, all target sequence positions) must receive copies of more than one of the source sequence elements. Thus, source sequence elements ss_}, ss₂, ss₃ and ss₄ depicted in the central portion of Figure 6 are all copied to produce target sequence element tS_j', source sequence elements ss₅, ss₆ and ss₇ are all copied to produce target sequence element ts₂, etc. as depicted in the lower portion of Figure 6. If more than one source sequence element is copied to produce a single target sequence element as aforesaid, the value of the resultant single target sequence element is determined as a weighted average of the source sequence elements in question. For example, source sequence elements ss_}, ss₂, ss₃ and ss₄ are weighted to produce target sequence element ts_} as: tS_j = WJSSJ + W2SS₂ + W3SS₃ + W4SS₄ where w₇, w₂, w₃, w₄ are weighting values which can be obtained in any one of a number of ways well known to persons skilled in the art. The interpolation/decimation operation of the preferred embodiment of the invention is expressed in pseudo-code as follows:

If d_h < d_sm, then

end for

^s _uq(3_k> ⁼ weighted average {H_uq(k) , . . . , H_uq{k+i -l) } end for

Spectral Magnitude Code book Training

The vector quantizer code book, _^,, (Figure 4, blocks 46, 48) is obtained by generating a very large training set of spectral magnitude vectors, S_uq, obtained from a database of different speakers and sentences. After the training set vectors are obtained, the code book, C_sm, is obtained by means of the LBG algorithm (see Y. Linde, A. Buzo and R.M. Gray, "An algorithm for Vector Quantizer Design", IEEE Transactions on Communications, Vol. COM-28, pp. 84-95, January 1980). Once the code book, C_sm, has been obtained, any spectral magnitude vector can then be encoded by selecting a suitable vector from the code book.

Encoding Spectral Magnitude Vectors The code book, C_sm, consists of M vectors of dimension, d_sm = 64. In the preferred embodiment, M = 256. Encoding the vector, S_uq, (Figure 4, block 40) involves selecting a vector entry from the code book, C_sm, that minimizes a specified error criterion. The spectral magnitude index, i_sm, denotes the vector entry selected from the spectral magnitude code book, C_^.

A weighted mean square error criterion is used for the code book search. The weighting function, w_^, used in the search procedure, is defined as follows:

^W _Sm -J' S_uq (j) <0 . 25 J = 1 , . .

j≤d_sm/2\ otherwise

Given the weighting function, w_sm, as indicated above, the code book search procedure is as follows:

-38

Cmin = 1 ° for i=l to M . j^{) )} **„ { )

end for

Decoding Spectral Magnitude Vectors

Given the index, i_sm, the quantized spectral magnitude vector, 5, is obtained (Figure 4, block 42) by copying the (i_s ^h vector from the code book, C_sm, as follows:

SO) = C i_smj) foτj=l,...,d_sm Computation of Pulse-shape Vectors

Pulse-shape vectors are computed (Figure 4, block 44) using the quantized pitch, p, not the unquantized pitch p_u. More particularly, the d_sm elements of the vector S are used in obtaining the complex spectrum signal, X = {X(j) j — 0, ... 2d_sm-\}, as follows, Re(X(0J) = 0.0

Rerø) = Re(Xβd_sm-j)) = (p/2)S(j) forj= \ ,...,d_sm

Rerø ) = 2-ReCrtø )

Having obtained the complex spectrum signal, X, which is a complex vector of dimension 2d_sm, the complex time-domain pulse signal, x, is obtained by taking a 2^_m-point Inverse Fast Fourier Transform (IFFT) of AT.

The pulse shape vector, v, is then obtained (Figure 4, block 48) as follows:

v(j) = R O)) for 7=0,..., [p/2] -1 v(p-j) = Re(x(2d_sm-j)) for ./ = 1 ,...,/?- [p/2j

As will be apparent to those skilled in the art in the light of the foregoing disclosure, many alterations and modifications are possible in the practice of this invention without departing from the spirit or scope thereof. For example, as noted above, the weighting values used in interpolation of the spectral magnitude vector (Figure 6) can be obtained in any one of a number of ways well known to per- sons skilled in the art. The same is true of the weighting function, w_sm, used in searching the code book, as described above in the section headed "Encoding Spectral Magnitude Vectors". As a further example, different mapping techniques can be used in the interpolation/decimation processes described above in relation to Figures 5 and 6. Thus, instead of mapping the first element of the source sequence to the first element of the target sequence and the last element of the source sequence to the last element of the target sequence (which may not be very accurate, and may not yield good results for larger values of d_srr) one could alternatively compute the frequencies corresponding to the first and the last element of the source sequence and map those source sequence elements to the target sequence elements having the nearest corresponding frequencies. This of course means that choice of an appropriate value for d_sm is another source of variation on the algorithm. Accordingly, the scope of the invention is to be construed in accordance with the substance defined by the following claims.

Claims

WHAT IS CLAIMED IS:

1. A method of determining a pulse shape vector v for a linear predictive speech coder from a voiced residue pulse v_uq, during a sampling instant n characterized by a gain g and a pitch period p, said method characterized by:

(a) deriving a d_sm dimension spectral magnitude vector S_tt<7 representative of the frequency spectral magnitude of said pulse during said sampling instant;

(b) providing a code book C_sm containing a plurality of vectors representative of pre-selected spectral magnitude vectors;

(c) selecting, from said code book, one of said plurality of vectors which provides a minimum error approximation to said spectral magnitude vector S_uq, said selected minimum error approximation vector having a spectral magnitude index i_sm within said code book;

(d) deriving a quantized spectral magnitude vector 5 having said spectral magnitude index i_sm and having d_sm elements;

(e) deriving a complex frequency spectrum signal X having real and imaginary components for each of said elements; (f) converting said complex frequency spectrum signal X to a complex time-domain representation x; and, (g) deriving said pulse shape vector v from the Real components of said complex time-domain representation x.

2. A method as defined in Claim 1 , wherein said derivation of said spectral magnitude vector S_uq further comprises: (a) deriving an unquantized time domain pulse shape vector v_uq, where:

forj=0,...,p_u-l

(b) deriving a complex spectrum signal V_uq by taking a p_u- point Discrete Fourier Transform of said unquantized time domain pulse shape vector v_uq,

(c) deriving a harmonics vector H_uq, where:

. , d_h

(d) interpolating said harmonics vector H_uq to form said spectral magnitude vector S_uq.

3. A method as defined in Claim 2, wherein said harmonics vector H_uq has a dimension d_h, and wherein said interpolating further comprises: (a) if d_h=d_m, setting S_uq equal to H_uq, (b) if ^ <^:

(i) copying a first element of S_uq to a corresponding first element position of H_uq; (ii) copying a last element of S_uq to a corresponding last element position of H_uq; (iii) for each intermediate element of S_uq between said first element of S_uq and said last element of S_uq, copying said intermediate element to a closest corresponding intermediate element position of H_uq, (iv) for any one of said intermediate element positions of H_uq to which no intermediate element of S_uq is copied, copying to said one intermediate element position of H_uq a value derived by interpolation between a first element of H_uq immediately adjacent and on a first side of said one intermediate element position of H_uq and a second element of H_uq immedi- ately adjacent and on a second side of said one intermediate element position of H_uq; (c) if d_h >d_sm,

(i) copying a first element of S_uq to a corresponding first element position of H_uq; (ii) copying a last element of S_uq equal to a corresponding last element position of H_uq; (iii) for each intermediate element of S_uq between said first element of S_uq and said last element of S_uq, copying said intermediate element to a closest corre- sponding intermediate element position of H_uq; and,

(iv) for any one of said intermediate element positions of H_uq to which more than one intermediate element of S_uq is to be copied, copying to said one intermediate element position of H_uq a weighted average of all of said more than one intermediate elements of S_uq. A method as defined in Claim 1 , wherein said selection of said minimum error approximation vector further comprises: (a) deriving a weighting function w_jm, where

w„ , (J ) 1 , d_s

(b) deriving an error value e.- for each vector in said code book C_^, where:

- (S '_uu_σq(j) -C_3m ( i , j) ) ²w_sm (j)

and i is said index of said vector within said code book C^, and, (c) selecting that one of said plurality of vectors with index i_sm within said code book C_sm for which e_im < e-. for all i ≠ i_sm.

A method as defined in Claim 1 , wherein said derivation of said quantized spectral magnitude vector 5 further comprises deriving SO) = C i_smJ) foτj= l,...,d_m.

6. A method as defined in Claim 1, wherein said derivation of said complex spectrum signal X further comprises: (a) setting Re(X(0J) = 0.0; (b) selling Rerø) = R (Xβd_sm~j)) = (p/2)S0) for 7 = 1 d ^•

(c) setting Re(X(d ) = 2-Re(x(d ); and,

(d) setting Imrøl) = 0.0 for ; = 1 , ... ,2d_m-l .

7. A method as defined in Claim 6, wherein said conversion of said complex frequency spectrum signal X to said complex time- domain representation x further comprises deriving an inverse Fourier transform of X.

8. A method as defined in Claim 7, wherein said derivation of said pulse shape vector v from said Real components of said complex time-domain representation x further comprises:

(a) setting vQ) — 0-0 for j=0,...,p-l,' (b) setting vtfj = Re(t(/J)

-1 ; and,

(c) setting v(p-j) = Re(x(2d_stn-j)) for j = 1 , ... ,p-

.