COMPUTATION AND QUANTIZATION OF VOICED
EXCITATION PULSE SHAPES IN LINEAR
PREDICTIVE CODING OF SPEECH
Technical Field
This invention is directed to linear predictive coding of voiced speech sounds. A single code book containing code words representative of different frequency spectra facilitates reconstruction of speech sounds, irrespective of pitch differences in such sounds.
Background
Linear Predictive Coding (LPC) of speech involves estimating the coefficients of a time varying filter (henceforth called a "synthesis filter") and providing appropriate excitation (input) to that time varying filter. The process is conventionally broken down in two steps known as encoding and decoding.
As shown in Figure 1, in the encoding step, the speech signal s is first filtered by pre-fϊlter 10. The pre-filtered speech signal sp is then analyzed by LPC Analysis block 14 to compute the coeffi- cients of the synthesis filter. Then, an "analysis filter" 12 is formed, using the same coefficients as the synthesis filter but having an inverse structure. The pre-filtered speech signal sp is processed by analysis filter 12 to produce an output signal u called the "residue". Information about the filter coefficients and the residue is passed to the decoder for use in the decoding step.
In the decoding step, a synthesis filter 18 is formed using the coefficients obtained from the encoder. An appropriate excitation signal e is applied to synthesis filter 18 by excitation generator 16, based on the information about the residue obtained from the encoder.
Synthesis filter 18 outputs a synthetic speech signal y, which is ideally the closest possible approximation to the original speech signal s.
The present invention pertains to excitation generator 16 and to the way in which information about the residue passes from the encoder to the decoder. Analysis filter 12 and synthesis filter 18 are exact inverses of each other. Therefore, if the residue signal u were applied directly to synthesis filter 18, the decoder would exactly reproduce the pre-filtered speech signal sp. In other words, if the precise residue signal u could be transferred from the encoder to the decoder, then the synthetic speech output signal y would be of very high quality (i.e. as good as the pre-filtered speech signal sp). However, bandwidth restrictions necessitate quantization of the residue signal u, which unavoidably distorts the excitation signal e and the resultant synthetic speech signal y. Excitation generator 16 incorporates both a "voiced" excitation generator, and an "un-voiced" excitation generator. The quantization process exploits structural differences between voiced and unvoiced components of the residue. The voiced residue is quasi- periodic, while the unvoiced residue is like a randomly varying signal. The present invention deals particularly with quantization of the voiced residue, and corresponding generation of voiced excitation in the decoder.
The voiced residue can be described in terms of three parameters for quantization purposes: pitch, pu; gain, g; and, the shape of a single cycle, called the pulse shape. Pitch refers to the periodicity of the signal and is equal to the distance between subsequent pulses in the residue signal u. Gain refers to the energy of the signal and is
higher for a residue having higher energy. The pulse shape is the actual geometric shape of each pulse (a single cycle) in the voiced residue. A typical voiced residue signal is shown in Figure 2.
Prior art LPC coding techniques have quantized pitch and gain parameters, but have achieved only poor representation of pulse shapes. For example, early LPC coders used single unit impulses to represent pulse shape (Markel, J.D. and Gray, A.H. Jr., "A Linear Prediction Vocoder Simulation Based Upon the Autocorrelation Method", IEEE Trans. ASSP, Vol. 22, 1974, pp. 124-134); the LPC- 10 government standard (U.S. Government Federal Standard 1015, 1977) represented each pulse by a fixed shape; and more recently, excitation pulse shapes have been represented as a sum of a fixed shape and random noise (McCree, A.V. and Barnwell III, T.P., "A Mixed Excitation LPC Vocoder Model for Low Bit Rate Speech Coding," IEEE Trans, on Speech and Audio Processing, Vol. 3, No. 4, July 1995, pp. 242-250). Pulse trains constructed from such restricted shapes provide a poor representation of the variations in pulse shapes observed in residual signals output by analysis filter 12, as is evident from the sample residue signal shown in Figure 2. A common technique known in the art of speech coding is
"vector quantization", in which a vector of samples (e.g. a signal segment) is represented as one of a predetermined set of vectors called "code words". All of the code words are assembled to form a table called a "code book". The difficulty in using a standard vector quantization approach is that the pulse shapes required to be represented in LPC based speech coding are not of fixed length, but vary with pitch period. In principle, one could construct a plurality of code
books, one for each possible value of pitch period, but this approach requires too many code books. It is impractical in many cases to use multiple code books due to memory limitations of the hardware in which speech encoding and decoding capabilities are preferably pro- vided. For example, large integrated circuit memory chips have relatively high power consumption requirements which cannot be satisfied in small battery powered systems such as voice pagers, which must remain active for months between battery replacements.
This invention provides improved representation of pulse shapes in LPC coding of voiced speech, irrespective of pitch period variations, and requires only a single code book. The dashed line shown in Figure 1 represents the transfer of information about the residue from analysis filter 12 to excitation generator 16. Figure 3 depicts this transfer in greater detail in respect of the aforementioned pitch, gain and pulse shape parameters. However, the present invention focuses only on transfer of an improved pulse shape parameter in LPC coding of voiced speech sounds.
Summary of Invention The invention facilitates good quality LPC coding of voiced speech sounds through better quantization of excitation pulse shapes for all possible pitch periods. Unlike prior art techniques which use fixed shape excitation pulses, or excitation pulses formed by adding random noise to a fixed shape, the invention utilizes a novel frequency domain code book with code words representative of signal frequency spectra, to select a pulse shape that closely matches the original pulse shape from the residue signal.
In particular, the invention provides a method of determining a pulse shape vector v for a linear predictive speech coder from a voiced residue pulse vuq, during a sampling instant n characterized by a gain g and a pitch period pu. A spectral magnitude vector Suq of dimension dsm is derived to represent the frequency spectral magnitude of the pulse during the sampling instant. A code book C^, containing a plurality of vectors representative of pre-selected spectral magnitude vectors is provided. A vector which provides a minimum error approximation to Suq is selected from the code book. ism is the spectral magnitude index, within the code book, of the selected minimum error approximation vector. A quantized spectral magnitude vector S having the spectral magnitude index ism and having dsm elements is then derived. A complex frequency spectrum signal X is derived from the quantized spectral magnitude vector S and the quantized pitch period p. This in turn is converted to a complex time domain representation x. The pulse shape vector v is then derived from the Real components of x.
Brief Description of Drawings Figure 1 is a block diagram representation of a prior art
LPC based speech encoder /decoder.
Figure 2 depicts a typical voiced residue signal waveform and the shapes of individual pulses found in typical voiced residue/excitation signals. Figure 3 is a block diagram representation of the information pathway over which information respecting the voiced residue is
transferred from the encoder to the decoder in the preferred embodiment of the invention.
Figure 4 is a block diagram representation showing further details of the pulse shape encoder and pulse shape decoder blocks depicted in Figure 3.
Figure 5 graphically depicts interpolation of a harmonics vector, in accordance with the invention, to produce a spectral magnitude vector for cases in which the dimension of the harmonics vector is less than the desired dimension of the spectral magnitude vector. Figure 6 graphically depicts decimation of a harmonics vector, in accordance with the invention, to produce a spectral magnitude vector for cases in which the dimension of the harmonics vector exceeds the desired dimension of the spectral magnitude vector.
Description
Introduction
As previously explained, the pre-filtered signal, sp, (Figure 1) is obtained by passing the original speech signal, s, through a pre-processing filter 10. The residue, u, is obtained by passing the pre-filtered signal, sp, through a time-varying all-zero LPC analysis filter 12. The coefficients applied to filter 12 are obtained by LPC analyzer 14 using techniques which are well known to persons skilled in the art and need not be described here.
If, at any desired sampling (time) instant, n, the original speech signal s is classified as voiced (using techniques which are well known in the art), then a pulse-shape vector vuq is obtained as described below for that particular sampling instant. The energy at
any sampling instant, n, is represented by a gain, g, corresponding to the root mean square value of the residue over a window (typically having a length of 80-160 samples) centred at the sampling instant, n. The pitch period at any sampling instant, n, as determined in the speech encoder, is denoted by pu and the quantized pitch at the speech decoder is denoted by p.
More particularly, as seen in Figure 3, voicing and gain analyzer 20 receives original speech signal s and residue u, and outputs signals representative of pitch period pu, gain g and pulse-shape vector vuq respectively. On the encoder side, pitch encoder 24 processes pitch period pu for further processing by pitch decoder 34 on the decoder side to yield quantized pitch p, which is in turn input to the decoder's voiced excitation generator 22. Pulse shape encoder 28 processes pulse-shape vector vuq for further processing by pulse shape decoder 30 to yield pulse shape vector v for input to voiced excitation generator 22. Gain encoder 26 processes the gain characteristic of the signal output by voicing and gain analyzer 20 for further processing by gain decoders 32, 36 which respectively yield the gain g for input to voiced excitation generator 22 (on the decoder side) and pulse shape encoder 28 (on the encoder side). The operation of pulse shape encoder 28 and pulse shape decoder 30 will now be described in further detail, with reference to Figure 4.
Computation of Spectral Magnitude Vectors A spectral magnitude vector, Suq, is obtained (Figure 4, block 38) as follows. First, an unquantized time domain pulse shape vector, vuq, is determined as:
, .. u (n-l (pu-l) /2i+j) v ( 7 ) = u«r ; 10 (ff/20)
forj=0,...,pu-l
A complex spectrum signal, Vuq, which is a complex vector of dimension, pu, is then obtained by taking a /? ..-point Discrete Fourier Transform (DFT) of vuq. A harmonics vector, Huq, of dimension, dh, is then obtained from V . More particularly:
The spectral magnitude vector, Suq, (of dimension, dsm=64, in the preferred embodiment of the invention), is obtained from the harmonics vector, Huq, by interpolation or decimation. Three cases must be considered, namely those in which dh=dsm, those in which dh <dsm, and those in which d^ d^. If dh=dsm, then Suq is set equal to Huq. Note that the number of harmonics, dh, is related to pitch, is time variant, and varies with individual speakers, whereas dm is fixed.
Figure 5 illustrates the interpolation process for the case dh <dsm, for representative values of dh=9 and dsm= \4. The two end elements ss}, ss9 of a source sequence of dh elements (upper portion of Figure 5) are initially repositioned (central portion of Figure 5) to coincide with the end elements tSj, ts14, of the desired target sequence (lower portion of Figure 5). The source sequence elements are equi- spaced, as are the target sequence elements, although the spacings are of arbitrary size in each sequence. Then, the source sequence elements between the end points are copied to the nearest element
positions in the target sequence. Thus, source sequence elements ssl f ss2, ss3, and ss4 depicted in the central portion of Figure 5 are copied to produce target sequence elements ts}, ts3, ts5, and ts6 respectively, as depicted in the lower portion of Figure 5. Since dh<dsm (i.e. 9 < 14), some empty positions, such as ts2 and ts4 remain in the target sequence. These empty positions are filled by inserting values obtained by interpolation between the closest adjacent target sequence values copied from the source sequence. Thus, the value inserted in empty position ts2 is obtained by interpolation between the previously copied target sequence elements ts}, ts3; and, the value inserted in empty position ts4 is obtained by interpolation between the previously copied target sequence elements ts3, ts5, etc.
Figure 6 illustrates the decimation process for the case dh>dsm, fof representative values of dh=25 and dsm=S. The two end elements ssj, ss25 of the source sequence of dh elements (upper portion of Figure 6) are initially repositioned (central portion of Figure 6) to coincide with the end elements tslf ts25 of the desired target sequence (lower portion of Figure 6). Then, the source sequence elements between the end points are copied to the nearest element positions in the target sequence. Since dh > dsm (i.e. 25 > 8), some target sequence positions (in the case illustrated, all target sequence positions) must receive copies of more than one of the source sequence elements. Thus, source sequence elements ss}, ss2, ss3 and ss4 depicted in the central portion of Figure 6 are all copied to produce target sequence element tSj', source sequence elements ss5, ss6 and ss7 are all copied to produce target sequence element ts2, etc. as depicted in the lower portion of Figure 6. If more than one source sequence element is
copied to produce a single target sequence element as aforesaid, the value of the resultant single target sequence element is determined as a weighted average of the source sequence elements in question. For example, source sequence elements ss}, ss2, ss3 and ss4 are weighted to produce target sequence element ts} as: tSj = WJSSJ + W2SS2 + W3SS3 + W4SS4 where w7, w2, w3, w4 are weighting values which can be obtained in any one of a number of ways well known to persons skilled in the art. The interpolation/decimation operation of the preferred embodiment of the invention is expressed in pseudo-code as follows:
If dh < dsm, then
s uq(3
k>
= weighted average {H
uq(k) , . . . , H
uq{k+i -l) } end for
Spectral Magnitude Code book Training
The vector quantizer code book, ^,, (Figure 4, blocks 46, 48) is obtained by generating a very large training set of spectral magnitude vectors, Suq, obtained from a database of different speakers and sentences. After the training set vectors are obtained, the code book, Csm, is obtained by means of the LBG algorithm (see Y. Linde, A. Buzo and R.M. Gray, "An algorithm for Vector Quantizer Design", IEEE Transactions on Communications, Vol. COM-28, pp. 84-95, January 1980). Once the code book, Csm, has been obtained, any spectral magnitude vector can then be encoded by selecting a suitable vector from the code book.
Encoding Spectral Magnitude Vectors The code book, Csm, consists of M vectors of dimension, dsm = 64. In the preferred embodiment, M = 256. Encoding the vector, Suq, (Figure 4, block 40) involves selecting a vector entry from the code book, Csm, that minimizes a specified error criterion. The
spectral magnitude index, ism, denotes the vector entry selected from the spectral magnitude code book, C^.
A weighted mean square error criterion is used for the code book search. The weighting function, w^, used in the search procedure, is defined as follows:
W Sm -J' S
uq (j) <0 . 25 J = 1 , . .
j≤d
sm/2\ otherwise
Given the weighting function, wsm, as indicated above, the code book search procedure is as follows:
-38
Cmin = 1 ° for i=l to M . j) ) **„ { )
Decoding Spectral Magnitude Vectors
Given the index, ism, the quantized spectral magnitude vector, 5, is obtained (Figure 4, block 42) by copying the (is h vector from the code book, Csm, as follows:
SO) = C ismj) foτj=l,...,dsm
Computation of Pulse-shape Vectors
Pulse-shape vectors are computed (Figure 4, block 44) using the quantized pitch, p, not the unquantized pitch pu. More particularly, the dsm elements of the vector S are used in obtaining the complex spectrum signal, X = {X(j) j — 0, ... 2dsm-\}, as follows, Re(X(0J) = 0.0
Rerø) = Re(Xβdsm-j)) = (p/2)S(j) forj= \ ,...,dsm
Rerø ) = 2-ReCrtø )
Having obtained the complex spectrum signal, X, which is a complex vector of dimension 2d
sm, the complex time-domain pulse signal, x, is obtained by taking a 2^
m-point Inverse Fast Fourier Transform (IFFT) of AT.
The pulse shape vector, v, is then obtained (Figure 4, block 48) as follows:
v(j) = R O)) for 7=0,..., [p/2] -1 v(p-j) = Re(x(2d
sm-j)) for ./ = 1 ,...,/?- [p/2j
As will be apparent to those skilled in the art in the light of the foregoing disclosure, many alterations and modifications are possible in the practice of this invention without departing from the spirit or scope thereof. For example, as noted above, the weighting values used in interpolation of the spectral magnitude vector (Figure 6) can be obtained in any one of a number of ways well known to per- sons skilled in the art. The same is true of the weighting function, wsm, used in searching the code book, as described above in the section headed "Encoding Spectral Magnitude Vectors". As a further
example, different mapping techniques can be used in the interpolation/decimation processes described above in relation to Figures 5 and 6. Thus, instead of mapping the first element of the source sequence to the first element of the target sequence and the last element of the source sequence to the last element of the target sequence (which may not be very accurate, and may not yield good results for larger values of dsrr) one could alternatively compute the frequencies corresponding to the first and the last element of the source sequence and map those source sequence elements to the target sequence elements having the nearest corresponding frequencies. This of course means that choice of an appropriate value for dsm is another source of variation on the algorithm. Accordingly, the scope of the invention is to be construed in accordance with the substance defined by the following claims.