GB2199215A

GB2199215A - A stochastic coder

Info

Publication number: GB2199215A
Application number: GB08728816A
Authority: GB
Inventors: Philip George Hammett
Original assignee: British Telecommunications PLC
Current assignee: British Telecommunications PLC
Priority date: 1986-12-23
Filing date: 1987-12-09
Publication date: 1988-06-29
Also published as: GB8630820D0; GB8728816D0

Abstract

A stochastic speech coder is described in which real speech is received (4), analysed (13 and 14) to determine coefficients for a vocal tract filter (3), and broken down into frames of samples. For each frame of samples stored in a buffer (5) every stochastic sequence stored in a codebook (1) is scaled (8) and filtered (3) before being compared (9) with the sorted frame. The minimum means (11) square (12) error signal so produced is used to select the optimum sequence so that its index number can be transmitted to a decoder. The identity of the vocal tract filter (3) is also transmitted to the decoder. The analysis circuitry (13, 14) uses vector quantization and this leads to a reduction in the bit rate required to transmit the identity of the vocal tract filter (3), and a simplification of the computation required to select the optimum stochastic sequence. <IMAGE>

Description

STOCHASTIC COOER This invention concerns a stochastic coder. Such coders encode real speech input signals by breaking them down into blocks, and generating sets of defining parameters for each block. These parameters are transmitted to a decoder at much lower bit rates than the original speech signals could be transmitted. At the decoder a synthetic version of the original speech is generated. High quality synthetic speech can be produced using this technique, which makes it an attractive approach to speech transmission when only a limited bandwidth (bit rate) is available; however it requires a high degree of computational complexity in the coder if it is to operate in real time, Speech is typically represented in the frequency domain by a power spectrum as shown in figure 1.It can be described by the interaction of two largely independent functions: the vocal tract function arising from the acoustic resonances of the vocal tract, and giving rise to the spectral envelope; and the voice periodicity (pitch) function derived, in the case of voiced sounds, from the vibrations of the vocal chords, and, in the case of unvoiced sounds, from a current of air forced through constrictions in the vocal tract.

Speech frames can be synthesised from sequences of white Gaussian noise by filtering them through filters whose characteristics introduce the pitch and vocal tract functions. Atal and Schroeder [1) were the first to describe a stochastic linear predictive coder using this principle, and the operation of this coder will now be described with reference to figure 2 along with a decoder with reference to figure 3. A codebook 1 of one thousand and twenty four different sequences of white Gaussian noise (innovation sequences) is stored at the coder. Each innovation sequence comprises N (say twenty to forty) samples. Two speech adaptive recursive filters, 2 and 3, are used to introduce the pitch and vocal tract functions respectively. The pitch filter 2 is a single tap filter and uses a predictor with a delay loop of greater than N speech samples.The vocal tract filter 3 may be of tenth to twentieth order and has a shorter memory.

The original 'real' speech input signal is received at terminal 4, broken down into blocks of speech samples and each block is analysed by speech analysis circuits 6 and 7 (LPC and pitch filter analysis are well documented and will not be described here) to determine coefficients for the pitch and vocal tract filters appropriate to that block. The coefficients once determined are loaded into the filters.

Each block of the input speech signal comprises several (two to twelve) frames and each frame comprises N speech samples. Each frame is stored in turn in a buffer 5, and for each frame every innovation sequence is scaled 8 by a gain factor (g) and then filtered. The resulting one thousand and twenty four synthetic speech signals are then compared 9 with the original speech frame from the buffer 5 and each resulting error signal is further processed by a linear (perceptual weighting) filter 10, to attenuate those frequencies where the error is perceptually less important, and to amplify t@ose frequencies where the error is perceptually more important.The mean square of each weighted error is then calculated, 11 and 12, and the innovation sequence which results in the smallest error, the optimum fo that frame of input samples, is selected.

At the decoder (figure 3) a codebook of -ovation sequences, identical to that stored at the ccr, is also stored. The decoder also comprises a scaling amplifier 8 and pitch 2 and vocal tract 3 filters identic-- to those at the coder.

For ea cicok of speech samples receives W tne coaer tne coefficients determined for the piton ar@~ @ocal tract filters are se@t to the decoder; and for ea@@ frame @@e index number of the selected innovation sequence and the gain factor g are sent. Typically each block comprises four frames and, the transmission from coder to decoder updates the pitch and vocal tract filter coefficients every 10 msecs, while the innovation sequence index number and gain factor g are updated every 2.5 msecs.

The decoder uses these parameters to reconstruct the optimum synthetic version of the original speech. Thus the bit rate of the transmission has been reduced at the expense of computational complexity in the coder.

The most computationally demanding part of the operation of the coder is the selection of the optimum innovation sequence, and it is desirable to reduce the amount of computation required. The following measures, which do not degrade the performance of the coder, have previously been suggested by other authors: Firstly, the outputs of the filters 2 and 3 are determined not only from the present innovation sequence but also from any residual filter memories and, since the coder filter memories should have the same content as the decoder filter memories, (ie the memory arising from the last selected optimum innovation sequence), this memory must be reloaded into the coder filters before each innovation sequence is tested; however this process can be simplified, and the filter memories reset to zero if their effect is subtracted from the speech frame in buffer 5.

This is conveniently achieved by running the filters on memory only, ie with no input signal, and subtracting the result from the speech frame. If the pitch filter has all zeros in its memory then since its minimum delay is greater than N it will have no effect in the analysis loop and may be removed - the coefficients must still be determined 6 since this filter cannot be dispensed with in the decoder. Secondly, the perceptual weighting filter 10 in the arrangement shown must perform one thousand and twenty-four filtering ooerations fcr each frame Df input speech, and the weighting can Se ore efficiently perormec this filter is moved to the inputs of subtractor 9.In this position the perceptual weighting is combined into the response of the vocal tract filter 3, where it appears as a constant adjustment factor to the coefficients loaded into the filter by analysis circuit 7; and the weighting may be performed on the input speech once for each frame as it enters the buffer 5 [2]. The resulting coder configuration is illustrated in figure 4.

Consideration of the mathematics of the selection of the optimum innovation sequence illustrates where further simplification has been attempted: Let Yi, i=l. N represent the sequence of samples obtained by filtering an innovation sequence c; ;, j=l..N through the LPC vocal tract filter and the perceptual weighting filter with combined impulse response fi, then

This can be expressed in matrix notation as y = Fc (2) where F is an NxN matrix of the impulse response f with the term in the ith row and jth column given by The optimum innovation sequence is the one which minimises the mean squared error e given by

where g is the gain factor and x is a vector containing a frame of N perceptually weighted speech samples.

Differentiating with respect to g and equating to zero gives the optimum value for g as

suostittie \ < ) into (=) gives t.e mean square error:

The computational steps required to arrive at this value of the mean square error are shown in figure 5 (it is these computational steps that are actually performed, not the scaling and filtering previously mentioned). A block of the original speech is used to calculate the vocal tract filter coefficients and hence the impulse response to obtain the matrix F; a weighted frame x is used to calculate the transpose xT and the mean square term |Ix\|2; and all these calculations may be performed outside the analysis loop of the coder.

However, the convolution Fc must be performed, in respect of each frame, for every innovation sequence in the code book, inside the analysis loop in order for the numerator (xTFc)2 and denominator |IFCII2 to be derived. This convolution dominates this computation of the error, and is so computationally demanding that it is very difficult to implement economically in real time.

Trancoso and Atal [2] propose, in their autocorrelation approach, to simplify the computation by approximating a value of |IFCE2. In this approach = x F is calculated outside the analysis loop, and only the dot product term (tic)2 of the numerator, and the subsequent division and subtraction remain to be calculated inside the loop, see figure 6. This approach simplifies the computation significantly, so that it could be economically performed in real time, but the approximation of the denominator results in sub-optimal innovation sequences being selected, which degrades the coder's performance.

The main problem in the design of stochastic coders is to reduce the bit rate, and the computational complexity, while maintaining coding performance in real time.

The bit rate of transmissions from coder to decoder is determined by the number of bits required for the coefficients of the pitch and vocal tract filters together ith the index number and the gain factor. The coefficients of the vocal tract filter require the greatest number of bits and the efficiency with which they are quantized (after being determined and before transmission) has a substantial effect on the overall bit rate. In the past stochastic coders have been implemented using scalar quantization to code these coefficients. The coefficients are not directly coded instead the partial correlation coefficients are used. A quantizer with nonuniformily spaced levels is used, or alternatively the partial correlations are transformed to make their probability density more uniform.The precision with which each partial correlation is coded varies from one coefficient to another. In general higher order coefficients need less precision (one or two bits) than the lower order coefficients (four or five bits). For a sixteenth order filter, the sixteen coefficients can be quantised, using the most efficient scalar quantization techniques, by forty bits. This means that up to 240 different filters could be defined.

According to the invention there is provided a speech coder comprising: deriving means for deriving from an input speech signal, a representation of a synthesis filter response; first store means containing a plurality of stored excitation functions; identifying means for identifying that me of the stored excitation functions which when filtered according to the said synthesis filter response would produce a signal most closely resembling at least a part of the input speech signal; wherein, the deriving means is arranged in operation to select the filter representation from a stored set of such representations; ano the identifying means includes second store means having a plurality of locations each containing a oredetermining 'ncti of a respective combination of ore of the cxc tat Dr cunt onus and the response ccr esoanzing to one ev > e retresertatlons Of tne store set.

The invention makes use of the fact that applying the known techniques of vector quantization to the coding of the coefficients of the vocal tract filter not only reduces the bit rate required for transmissions from the coder to the decoder; but also, by reducing the number of possible filters that can be defined, enables the computation of the optimum innovation sequence to be simplified.

An embodiment of the invention will now be described by way of example with reference to the following drawings in which: figure 7 illustrates a coder embodying the invention; figure 8 illustrates a decoder for use with the coder of figure 7 figure 9 illustrates the steps required to compute the error function in the coder of figure 7; figure lOa illustrates vector quantization using a fully searched unstructured codebook; figure lOb illustrates vector quantization using a binary tree structured codebook; figure lOc illustrates vector quantization using a graded tree structured codebook Referring to figure 7, the coder shown is similar to that shown in figure 4 and the same reference numerals have been used to designate corresponding elements.

Vector, rather than scalar, quantization is used to code the coefficients of the vocal tract filter 3 prior to transmission to the decoder, and this process also replaces the previously separate process of determining the response of the vocal tract filter so is conveniently considered as partly contained within element 13 which replaces element 7 in figure 4. Since vector quantization is a known and well documented technique [33 it is not proposed to describe it in detail here. However, the implementation of vector quantization in a stochastic coder ano the advantages arising therefrom wil now be desctizeo.

For a sixteenth order vocal tract filter 3 there are sixteen coefficients which can be regarded as a vector in sixteen dimensional vector space. One vector describes one possible filter response. A sixteenth order filter can also be represented by a seventeen dimensional vector made up of the filter tap-coefficient autocorrelation values. These autocorrelation vectors are stored in addition to the coefficient vectors since the process of determining the filter configuration for a particular block of input speech is thereby simplified.

The coder comprises another codebook 14 of stored filter autocorrelation vectors (and coefficient values).

Speech signals received from the input terminal 4 are loaded, say, four frames of samples at a time, into a buffer within element 13 (four times as large as buffer 5). The autocorrelation vector of that block of samples is calculated and the table 14 is searched to find the filter autocorrelation vector which corresponds to the filter which would produce the minimum the residual energy - the residual energy is, in practise, calculated in respect of each filter response from the filter coefficient autocorrelation vector and the autocorrelation vector of the block of samples (the vector to be quantized); but the residual energy actually is the energy that would have been left after the speech block had been filtered by the inverse of the filter. The coefficients corresponding to this autocorrelation vector are then taken from the table and loaded into the vocal tract filter 3. An index number identifying the selectea filter is subsequently transmitted to the decoder together with the coefficients determined for the pitch filter; the gain factor g and index number of the selected innovation sequence are also sent to the decoder for each frame in the block.

At the decoder, figure 8, a cooeoook of filters 1: identical tc tnat at the coder is storey, altnouoh oreferacly te actual coefficient values, rather tna tre autocorrelation values, are stored so that when an index number is received the selected filter can be readily set up from the store.

When scalar quantization is used the number of possible filter configurations at the coder for any particular order of filter is only limited by the number of bits assigned to each coefficient, eg 240 previously mentioned. When vector quantization techniques are used the situation is different; the filter configurations that are stored are generated from a training sequence of real speech, so unused configurations will not be stored, and several little-used configurations can be approximated by a single entry.The way such vector quantized codebooks are constructed is well documented, eg [4], and will not be discussed in detail here suffice to say that it has been found, in subjective listening tests, that a codebook of only one thousand and twenty four filters will provide synthetic speech of only slightly lower quality than that obtained when scalar quatization is employed, with full retained intelligibility. Only ten bits are required to identify any particular sixteenth order filter from a codebook of one thousand and twenty four configurations.

In addition to the bit rate advantage of using vector quantization techniques there is also an advantage to be gained in the computation of the optimum innovation sequence: Going back to equation (5) the mean square error is given by:

It has already been shown that this computation is dominated by the convolution Fc. However, since there is now a finite number (1024) of possible filter configurations, as well as a finite number (1024) of innovation sequences, all possible convolution products can be precomputed ane the results stored in a table thus reducing the convolution computation to a simple loo < -uov process. A table of convolution products (Fc) would require the storage of approximately a million vectors.

Preferably, the mean square values of the convolution products ( IIFCIl 2 2 scalars) are stored since these require less storage space than the corresponding vectors. Alternatively, the entire denominator term, in reciprocal form l/IIFc(I 2, may be stored in the table, since multiplication is easier than division on many of the integrated circuits that are used to perform these calculations (eg DSP chips). Also, advatageously the scalars are stored in logarithmic, rather than linear, form since this allows the smaller scalars to be stored with greater accuracy.

When sixteen bits are available per scalar the entire table would require a megaword of memory. Reducing to twelve or eight bits reduces the storage requirement with only small degradation in performance. A table of this size can be stored with a few large capacity PROMS or one or two memory modules, an example of such memory devices is given in reference [5].

The steps required to perform the computation of the error have now been simplified and an example of how the computation may be performed will be briefly described with reference to figure 9. Four frames of input speech samples are autocorrelated so that the vocal tract filter coefficients (and hence F) can be selected. A perceptually weighted frame of speech samples, x from the buffer 5 is taken and the transpose xT and mean square '(xtI2 calculated. The dot product t = xTF may then calculated (as before).This leaves, inside the analysis loop, only: the dot product term ( W c) 2 of the numerator; the multiplication of this term with the reciprocal of the denominator term, obtained from the table of scalars; anD the difference from x (I 2 to be calculated.

The efficiency of the search method adopted to interrogate the codebook 14 and find the appropriate filter configuration (for the block of input speech) is important in determining the time required to conduct the search. The simplest, but most computationally demanding method is to find the residual energy using the vector to be quantized and every one of the one thousand and twenty four filter autocorrelation vectors, stored in the codebook 14, see figure lOa. The binary tree structure has been suggested to reduce the amount of computation required, see figure lOb. The codebook in this case is split into levels, the first level contains two vectors, the second four and so on. The tenth level contains the one thousand and twentyfour filter autocorrelation vectors as in the full search case, and a ten bit index number is transmitted as before.It should be appreciated that the tree structure is only applied to the filter autocorrelation vectors; the sets of filter coefficient values are stored as before, each set corresponding to one of the one thousand and twentyfour autocorrelation vectors on the lowest level. The search procedure is as follows: the residual energy is found using the vector to be quantized and each of the two vectors at the first level, and the first level vector producing the minimum residual energy selected, this effectively divides the vector space in two. The residual energy is then found using the vector to be quantized and the two appropriate vectors in the second level. This is continued until the final level is reached. The vector selected at the final level represents the quantized version of the input vector and its index number is transmitted.The codebook search procedure is analogous to the successive approximation method of analogue to digital conversion: at the first level a rough approximation is made by comparing with only two vectors, the approximation improves~ at each subsequent level.

The advantage is that only twenty calculations are performed rather than 210 as in the full search. An apparent disadvantage is that double the storage space is required for the codebook 14 at the coder when it is stored in binary tree form. However, the storage requirement can be halved (ie. so that it is the same as the full search case) by storing difference vectors; this also reduces the computation. The real disadvantage is that quantization performance may be adversely affected since the search procedure can lead to a non-optimum vector being selected.

Other codebook structures are possible. The quanternary tree (not shown) has four branches from each node, so five levels are needed for a ten bit codebook.

The residual energy is found using the vector to be quantized and each of the four vectors at the first level, and the vector producing the minimum selected. The residual energy is then calculated using the vector to be quantized and the four associated vectors at the second level, and so on until the fifth level is reached. Four calculations at each of the five levels are needed giving twenty in total. In this case it is not advantageous to store the difference vectors so the storage requirement is greater than for the full search. However, quantization performance is better than that for the binary tree structure.

There is no particular reason why a codebook tree should have the same number of branches per node at each level. In a preferred embodiment a graded tree structure is used, illustrated in figure lOc. This has two branches per nooe at the first level, four branches per node at the second, eight at the third and sixteen at the fourth.The motivation for this is two fold: firstly, it is efficient in terms of codebook storage requirement; in this case it requires one thousana and ninety eight vectors, only just greater than the fully searched cooeook. Secondly, cuarticato oerforrnance is improved sine more crooce is left to the lower~levels of the codebook. The disadvantage is that thirty calculations are now required to search the codebook, but this is still much fewer than the full search requires.

Of course, it should be appreciated that the use of graded tree structured codebooks is not restricted to embodiments of the present invention; this type of codebook may advantageously be employed in any vector quatizer, and particularly in any LPC-based speech coder.

Davidson and Gersho [6] have suggested that instead of using innovation sequences comprising white gaussian noise, innovation vectors containing a number of randomly positioned pulses of random amplitudes separated by zeros should be used. If such "sparse" innovation vectors are used, with, say, only four pulses per innovation vector the storage space for the stochastic codebook can be considerably reduced - since only the amplitudes and positions of the four pulses have to be stored, instead of an amplitude for every position (twenty, in the case described) in each vector. Sparse innovation vectors may be derived from the white gaussian innovation sequences previously used by preserving the four samples with the largest amplitudes and setting all the remaining samples to zero.

When the stochastic codebook comprises sparse innovation vectors ci (i=l. . . .N) each with four pulses and the pulse positions and amplitudes are stored as pl, p2, p3, p4 and al, a2, a3, a4 respectively; then if N] is an array containing the elements of the vector tr(= xTF), the dot product may be efficiently computed as follows: = c = ttpl] al+'tp2] a2+t[p3] a3+[p4] a4; the number of multiply/add operations in computing the numerator is four (cf 20, in the case when the stochastic codebook comprises twenty samples of white Gaussian noise).

In addition to this computational saving the derivation of the precomputed denominators 1/IIFcn 2 is also simplified. Thus the use of sparse innovation vectors works particularly well when combined with the denominator storage method. Also, it has been found in subjective listening tests that the overall performance of the coder is marginally improved when using sparse innovation vectors having only four pulses per vector, when compared to coders using Gaussian innovation sequences. This improvement in performance elegantly counteracts the degradation introduced by vector quantization of the vocal tract filter coefficients.

(1] Schroeder, M.R., and Atal, B.S.: 'Code-excited linear prediction (CELP): high-quality speech at very low bit rates', Proceedings of IEEE international conference on acoustics, speech and signal processing, Tampa, Florida, 1985, pp. 937-940.

t23 Trancoso, I.M., and Atal, B.S.: 'Efficient procedures for finding the optimum innovation in stochastic coders', Proceedings of IEEE international conference on acoustics, speech and signal processing, Tokyo, 1986 pp.2375-2378.

(3] Buzo, A., Gray, A.H., Gray, R.M., and Markel, J.D.,: 'Speech coding based upon vector quantization', IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-28, No. 5, October 1980.

[4] Y Linde, A. Buzo and R.M. Gray, 'An Algorithm for Vector Quantization Design', IEEE Transactions on Comms, Vol. Com-28, Nol., Jan 1980.

[5] 1986 IC MASTER, Hearst Business Communications Inc., pp 3825 and 3853.

t6] Davidson, G. and Gersho, A.: 'Complexity reduction methods for vector excitation coding', Proceedings of International Conference on,Acoustics, Speech and Signal Processing, ICASSP 86, Tokyo.

Claims

1. A speech coder comprising: deriving means for deriving, from an input speech signal, a representation of a synthesis filter response; first store means containing a plurality of stored excitation functions; identifying means for identifying that one of the stored excitation functions which when filtered according to the said synthesis filter response would produce a signal most closely resembling at least part of the input speech signal; wherein, the deriving means is arranged in operation to select the filter representation from a stored set of such representations; and the identifying means includes second store means having a plurality of locations each containing a predetermined function of a respective combination of one of the excitation functions and the response corresponding to one of the representations of the stored set.

2. A speech coder as claimed in Claim 1 wherein the stored set of filter representations are autocorrelation vectors of the synthesis filter coefficients and the deriving means is arranged in operation to derive the autocorrelation vector of the input speech signal and to search the stored set of filter autocorrelation vectors to select the one which corresponds to the filter which would produce the minimum residual energy.

3. A speech coder as claimed in Claim 2 wherein the deriving means is arranged in operation to select the said filter autocorrelation vector by a process of successive approximation involving: selecting that one of a plurality of first level vectors which corresponds to the filter which would produce the minimum residual energy; then selecting that one of a plurality of second level vectors related to the selected first level vector which corresponds to the filter which would produce the minimum residual energy; and so on until the selection is made from the plurality of filter autocorrelation vectors related to the selected penultimate level vector.

4. A speech coder as claimed in Claim 3 wherein that plurality of any level of vectors related to the selected proceeding level vector is greater than the plurality of preceeding level vectors.

5. A speech coder as claimed in Claim 4 wherein there are one thousand and twentyfour filter autocorrelation vectors at the fourth level; two first level vectors, four second level vectors related to each first level vector, eight third level vectors related to each second level vector, and sixteen filter autocorrelation vectors related to each third level vector.

6. A speech coder as claimed in Claim 1 in which the identifying means identifies that one of the stored excitation functions which when scaled by a scale factor and filtered according to the said synthesis filter response would produce a signal most closely resembling at least a part of the input speech signal.

7. A speech coder as claimed in Claim 6 in which the scale factor g for each excitation function is equivalent to the value that would be calculated from to the followina eouation:

where F is a matrix representing the said synthesis response; xT is the transpose of a vector representing the said at least part of the input speech signal; and c is a vector representing each excitation function.

8. A speech coder as claimed in Claim 7 in which the identifying means identifies that one of the stored excitation functions which produces the minimum results from the following calculation:

where F is a matrix representing the said synthesis filter response; x is a vector representing the said at least part of the input speech signal; and c is a vector representing each excitation function; wherein, F is derived from the selected filter representation; and ||x|| 2 and T are derived from x.

9. A speech coder as claimed in Claim 8 wherein the convolution products Fc are obtained from the second store means for use in calculating the denominator term or the numerator term, or both.

10. A speech coder as claimed in Claim 8 wherein the mean square values of the convolution products IIFCII2 are obtained from the second store means for use in calculating the denominator term.

11. A speech coder as claimed in Claim 8 wherein the reciprocal mean square values of the convolution products l/llFcl12 are obtained from the second store means.

12. A speech coder as claimed in Claims 10 or 11 wherein the values obtained from the second store means are each contained at a location in the second store means in logarithmic form.

13. A speech coder as claimed in Claim 1 wherein said input speech signal is a frame of a speech signal comprising N samples, and each excitation function is a sparse vector comprising N samples, a minority of which are non-zero samples.

14. A speech coder as claimed in Claim 13 wherein N is twenty and four of the samples in the sparse vector are non-zero.

15. A speech coder as hereinbefore described with reference to figure 7 and 9 of the accompanying drawings.

16. A speech decoder arranged in operation for use with a speech coder as claimed in any proceeding claim comprising: a first store means containing a plurality of stored excitation functions; a second store means containing a plurality of synthesis filter responses; and a synthesis means for selecting, under the influence of an input signal, one of the stored excitation functions and one of the synthesis filter responses and convolving them to create a synthetic output speech signal.

17. A speech decoder as herein before described with reference to figure 8 of the accompanying drawings.