AU675322B2

AU675322B2 - Use of an auditory model to improve quality or lower the bit rate of speech synthesis systems

Info

Publication number: AU675322B2
Application number: AU66720/94A
Authority: AU
Inventors: Warwick Harvey Holmes; Dipanjan Sen
Original assignee: Unisearch Ltd
Current assignee: Unisearch Ltd
Priority date: 1993-04-29
Filing date: 1994-04-29
Publication date: 1997-01-30
Anticipated expiration: 2014-04-29
Also published as: AU6672094A; WO1994025959A1

Description

WO 94/25959 PCT/AU94/00221 -1- USE OF AN AUDITORY MODEL TO IMPROVE QUALITY OR LOWER THE BIT RATE OF SPEECH SYNTHESIS SYSTEMS Field of the Invention The present invention relates to speech synthesis systems and, in particular, discloses a system basci upon an auditory model.

Background of the Invention Modern speech coding algorithms, such as Code-Excited Linear Prediction (CELP) and Multiband Excitation (MBE), have exploited properties of the human articulatory process. These schemes use parameters such as pitch, vocal-tract filter coefficients and voiced-unvoiced decisions to encode speech at low bit rates. Fig. 1A illustrates a practical CELP method 1 of obtaining synthesised speech from digital speech by exciting a vocal tract filter such as a weighted formant filter 2, with vectors chosen from a fixed codebook 3 and an adaptive codebook 4. This method has become the dominant technique in present day low bit rate speech coders. Representations of speech using the codebook indices and vocal-tract filter coefficients have achieved high coding gain.

A mean square error criterion 6 is used to determine the error between the weighted input digital speech, obtained via a weighting filter 5, and thie weighted synthesised speech in order to make selections from the codebooks 3,4. The stochastic codebook 3 includes a large number, typically about 1,000, of random signals each of which represents between 5 and 30 milliseconds of a sampled speech signal. The adaptive codebook 4 models the periodic components of speech and typically holds approximately about 0.1 seconds of speech, parts of which may be accessed by up to about 200 vectors.

Typically, selections from the stochastic codebook 3 and adaptive codebook 4 are chosen which minimize the mean square error between the weighted input digital speech and the weighted synthesized speech. This is performed for each frame, typically between 5 and milliseconds, of sampled speech. A zero impulse response (ZIR) filter 7 is used to compensate for framing effects between speech segments.

While these techniques have resulted in excellent coding gain, the quality of the synthesized speech is often fr from the transparent quality achieved in current high fidelity audio coding systems such as the MASCAM system. Apart from the extra bandwidth available in audio coding, the quality of the reproduced sound can be attributed to the modelling of the haunan auditory system in these systems. These schemes dynamically quantize the samples across the spectrum such that the quantization noise is not perceived. The threshold of perception is computed by modelling the frequency selectivity and masking properties of the human cochlea.

Auditory modals have not been used in low bit rate speech coding, possibly because of doubts about the achievable coding gain and the computational overheads.

Also, the maintenance of transparent quality has not been a consideration in most SUBSTITUTE SHEET (Rule 26) WVO 94/125959 PCT/AU94100221 -2algorithms, as distortion is always present. It has however long been recognised that to improve the perceptual quality of these speech coders, which are essentially voice production models, it is necessary to consider the psychoacoustic properties of the human ear. The only technique so far used in speech coding which allows in any way for the properties of the hearing process, is to shape the error signal spectrum such that noise levels tend to be generally less than the signal levels at all frequencies, even in formant valleys in the manner, shown in Fig. 1B. This is performed using the weighting filter of Fig. 1A. This scheme does not explicitly evaluate auditory characteristics it relies on the higher levels of the signal to lessen the effect of the noise and is not optimally matched to the hearing process. The noise spectrum using CELP and the weighting filter of Fig. 1A is shown in Fig. 1C.

Summary of the Invention It is an object of the present invention to substantially overcome, or ameliorate, some or all of the abovementioned problems.

In accordance with one aspect of the present invention there is disclosed a method of determining an optimum excitation in a speech synthesis system, said method comprising the steps of: analysing an input speech signal on the basis of an auditory model to identify perceptually significant components of said input speech signal; and selecting from a plurality of candidate excitation signals to the system, an excitation signal which results in a system output optimally matched to said input speech signal in the perceptually significant components of step In accordance with another aspect of the present invention there is disclosed a method of determining an optimum codebook entry (or "vector") in a code-book excited linear prediction speech system, the method comprising the steps of: using a perceptual hearing model of digitised speech to determine those perceptible sections of the speech spectra; passing a vector chosen from a stochastic codebook through a linear prediction filter to produce synthesised speech; selecting a codebook entry which minimises an error between the digitised and synthesised speech only in the regions of the spectrum found to be perceptible in step and post-filtering out those regions of the synthesised speech spectrum found to be imperceptible according to the hearing model.

In accordance with another aspect of the present invention there is disclosed a method of determining an optimum codebook entry in a multiband excitation speech coding system, the method comprising the steps of: using a perceptual hearing model of digitized speech to determine the perceptible sections of the speech spectra; SUBSTITUTE SHEET (Rule 26) WO 94125959 PCTIAU94100221 -3passing periodic or noise excitation signals through a multiplicity of bandpass filters to produce synthesized speech; selecting the parameters of such excitation signals to minimize an error between the digitised and synthesised speech only in the regions of the spectrum found to be perceptible in step and post-filtering out those regions of the synthesised speech spectrum found to be imperceptible according to the hearing model, and using the remaining codebook entries and linear prediction filter parameters in said system.

The significance of the above is the use in some embodiments of the perceptual model of hearing to isolate the perceptually important sections of the speech spectrum to determine the optimum stochastic codebook entries in CELP speech systems (coders).

In the preferred embodiment additional coding gain is achieved by analysing the perceptual content of each sample in the spectrum. An algoithmn is disclosed which is able to introduce selective distortion that is a direct function of human hearing perception and is thus optimally matched to the hearing process. It will be shown that good coding gain can be obtained with excellent speech quality. The algorithm may be used on its own, or incorporated into traditional speech coders. For example, the weighting filter in CELP can be replaced by the auditory model, which enables the search for the optimum stochastic code vector in the psychoacoustic domain. It can also be implemented with very little computational overhead and low coding delay.

Brief Description of the Drawings A number of embodiments of the present invention will now be described with reference to the drawings in which: Figs. 1A-1C show CELP and weighted noise arrangements of prior art systems; Fig. 2 illustrates auditory modelling based on the resonance in the Basilar Membrane of the inner ear; Fig. 3 shows the Bark (or Critical Band) Scale; Fig. 4 shows an example of a masking threshold; Fig. 5 shows the simplified masking threshold due to the component v; Fig. 6 shows the absolute threshold of hearing; Fig. 7 shows a sound level excess diagram; Fig. 8 illustrates a speech spectrum and a raised masking threshold; Fig. 9 shows an arrangement used for auditory model processing in embodiments of the invention; Fig. 10 illustrates an arrangement for a stochastic code vector search; Fig. 11 illustrates a first embodiment of the present invention; Fig. 12 shows the noise spectrum for the embodiment of Fig. 11; Fig. 13 illustrates the error calculation for a Noise Above Masking embodiment of the present invention; SUBSTITUTE SHEET (Rule 26) WO 94/25959 PCT/AU94/00221 -4- Fig. 14 shows the noise spectrum of the embodiment of Fig. 13; and Fig. 15 illustrates a device constructed in accordance with the preferred embodiment.

Detailed Description of the Best and Other Modes of Carrying Out the Invention The Auditory Masking Model Auditory models attempt to emulate the signal processing carried out by the human ear. Of particular interest is the function of the basilar membrane, the corresponding transduction process of the hair cells in the inner ear, and innervation of the auditory nerve fibres.

Examples of such models are described in: Zwicker, E. Zwicker, "Audio Engineering and Psychoacoustics: Matching Signals to the Final Receiver, the Human Auditory System", J. Audio Eng.

vol. 39, no. 3, March 1991, pp.115-12 5 (the "Zwicker Model"); and Allen, "Cochlear Modelling", IEEEB ASP Mag., vol. 2, no. 1, January 1985, pp.

3 2 9 (the "Allen Model").

This process causes hearing perception to be a function of the frequency of the sound, with frequency sensitivity depending on the position along the basilar membrane using the place theory as illustrated in Fig. 2. The critical bands are a direct result of the way sound is processed in the inner ear. Sound pressure is transmitted via the ear canal, the malleus, stapes, incus and finally to the fluid chambers of the cochlea. This results in a travelling wave along the basilar membrane. The movement of the membrane causes shearing in the hair cells attached to it thus leading to the hearing sensation. The travelling wave discards high frequency components as it moves towards the helicotrema.

The basilar membrane thus performs an instantaneous transform into the frequency domain. Observations have shown that the frequency selectivity of the basilar membrane, hair cells and nerve fibres follows the critical band scale (Bark scale) in Figure 3. The critical bands are approximately equally spaced along the basilar membrane.

As hearing perception is directly related to the deformation of the basilar membrane, caused by different frequency components, the critical bands are typically intimately related to a number of important psychoacoustic phenomena, including the perceived loudness of sounds.

Of particular importance is the fact that sound components at particular frequencies can reduce, or even totally suppress, the perceptual effects of other sound components at neighbouring frequencies, at least partly because of interaction effects along the basilar membrane. This phenomenon is called auditory masking. Using the travelling wave theory, it can be seen that the low frequency components which extend over most of the membrane are able to mask the high frequency components since the high frequencies predominate in the early portions of the membrane. However, this is not SUBSTITUTE SHEET (Rule 26) WO 94/25959 I'CT/AU94/00221 to say that high frequencies components do not mask the lower components but only that the effect is stronger in the former case.

As an example, Figure 4 displays the minimum pressure level required for isolated frequency components in the speech frequency range to be just perceptible in the presence of a 60 dB tone at 1000 Hz. It is seen that the 1000 Hz tone masks neighbouring frequencies. This tone is referred to as a masker in this context, and the minimum perceptible level of other tones is called the corresponding masking threshold.

Masking can be much more complex than this simple example. In general, each frequency component of a signal tends to mask other components of the same signal.

Masking can be classified into two types: Inter-Bark masking due to the interaction of frequency components across the whole frequency range, and Intra-Bark masking due to the frequency components within each bark-band. (This type of masking accounts for the observed asymmetry of masking between tones and noise.) Investigations into the masking effect of spectral samples on higher frequencies have led to the conclusion that there is a direct relationship between the level of the component and the amount of masking it induces on the spectrum. Thus, consider the masking effect of the vth signal component at frequency fv Hz and with sound pressure level L v The fundamental simplified model used in the preferred embodiment is described by Terhardt, "Calculating Virtual Pitch", Hearing j.search, 1979, pp. 155- 199 ("Terhardt's Model") and assumes that the slope of the masking threshold contribution due to this component on frequencies higher than itself is given by 230

S

v -24 dB/bark. (1) cv It has been found that the level of the masking signal is not so important when computing the masking effect on lower frequencies. Masking of lower frequencies is hence modelled using the level-independent relationship

S

v 27 dB/bark (2) This is illustrated in Figure The masking threshold at the uth frequency index due to all the components in the spectrum is then computed as Th( 20 log 1 0 i(L-szv-z))/20 (3) where sv is given by Equations or and Zv is the critical band from Figure 3.

The asymmetry of masking between tones and noise within each critical band, as described in Hellman, "Asymmetry of Masking Between Noise and Tone".

Perception and Psychophysics, vol. 11, No. 3, 1972 ("Hellman's Model"), is also allowed for in the auditory model used in the preferred embodiment. If there is a tonal component SUBSTITUTE SHEET (Rule 26) WO 94/25959 PCT/AU94/00221 -6at the frequency index at which the masking effect is being computed, the noise in the critical band depends on the spectral intensities away from the immediate neighbourhood of that index. This noise is modelled here by adding the spectral intensities of all samples in the critical band, except for the three samples directly neighbouring the index under consideration.

The absolute threshold of hearing is the level below which no sound is perceived by the ear. Terhardt's experimental results are used to incorporate the absolute threshold into the auditory model as illustrated in Fig. 6.

Other auditory models that can be used include: Gh "Auditory Nerve Representation for Speech Analysis/Synthesis", IEEETranLs....QASS, vol. 35, no. 6, pp. 736-740, 1987 ("Ghitza's model").

Lyon, "A Computational Model of Filtering, Detection and Compression in the Cochlea", Proc. IEE ICASS, pp. 1281-1285, 1982 ("Lyon's model").

Seneff, "A Joint Synchrony/Mean-Rate Model of Auditory Speech Processing", J Phonetics, vol. 16, pp. 55-76, 1988 ("Seneff's model").

Johnston, "Transform Coding of Audio Signals Using Perceptual Noise Criteria", IEE J. of Selected Areas in Communications, vol. 6, no. 2, pp. 314- 323, February 1988 ("Johnston's model").

Derivatives of those models can also be used, as can non-simultaneous masking models.

Preferred Implementation of the Audiory Model In the first step of the implementation, input speech samples are divided into the critical frequency bands. The filter-bank used for this purpose is realized using Time Domain Aliasing Cancellation (TDAC), as described in Princen, J.P. Bradley, A.B., "Analysis/Synthesis Filter Bank Design Based on Time-Domain Aliasing Cancellation", IEEE Trans.on.ASSP, vol 34, no. 5, pp. 1153-1161, 1986. This involves the analysis of 32 ms speech frames with 50% overlap, so that each frame contains 16 ms of the samples from the previous frame. This frame ,ength was chosen with pseudo-stationarity in mind.

The overlapping of the frames results in minimal blocking effects at the synthesis stage, and also reduces the effects of frame discontinuities on the spectral estimates. The frames are multiplied by a sine window which satisfies the criterion for perfect reconstruction.

This window also has a frequency response with reasonably low leakage into neighbouring samples high frequency selectivity). Low leakage is extremely important for accurate psychoacoustic modelling. Alternate Modified Discrete Cosine Transforms (MDCT) and Modified Discrete Sine Transforms (MDST) are then used to transform the data into the frequency domain. The following derivation shows that the MDCT (and similarly the MDST) can be computed using the FFT, thus making them computationally efficient.

SUBSTITUTE SHEET (Rule 26) WO 94/25959 I'CT/AU94/00221 The MDCT is given by P-1 2 i k6- 2k knI\ Xk= Ex(r)I(P-l-r)cos j2f r kno P-1 j2r h 2kr =Re exp F x(r)h(P-l-r)exp K r=0 K Re{exp(J2k knDFT{x(r)h(P-1 (4) where x(r) are the speech samples, h(r) is the window, and Xk is the transformed component.

An additional advantage of using the TDAC transform for the filter bank computation is the maintenance of critical sampling, even though the analysis frames are overlapped by Following the transformation into the frequency domain, the samples are grouped according to the critical bands. The masking threshold is then computed according to the foregoing theory.

Coding Gain Achieved Using the Auditory Model Alone Once the masking threshold is calculated, it can be used in a number of different ways in speech coders. Its use in CELP coders is discussed later, but first a general approach which can be used to generate new types of speech coders is discussed.

In principle, spectral samples below the masking threshold need not be transmitted. This may not in itself result in any substantial coding gain, but ensures transparent fidelity of the synthesized sound. In the synthesis process at the receiver, all the samples that lie above the threshold would be used to reconstruct the spectrum.

To obtain more coding gain at the cost of a slight loss of quality, the masking threshold may be increased by 3-5 dB across the spectrum. This procedure adds distortion in a controlled manner which is directly related to human auditory perception, as the distortion is based on a psychoacoustic model. That is, the distortion spectrum is optimized with respect to the human ear.

Figure 7 shows an example of the sound level excess, which is the difference between the spectrum and the masking threshold. Any sample below the zero line (shown) therefore represents samples below the auditory threshold, which can be discarded on transmission. This results in a punctured spectrum with "holes" in the regions considered to be masked.

Figure 8 shows the speech spectrum and the masking threshold raised such that (average) of the spectrum is discarded. Listening tests have shown that this amount of the spectrum can be discarded without noticeable loss of speech quality. When the threshold is raised further, and only 10% of the spectrum is used to synthesize the speech, the quality remains very good, but tonal artefacts begin to appear.

SUBSTITUTE SHEET (Rule 26) WO 94/25959 PCT/AU94/00221 -8- New types of speech coders can be based directly on the method presented in this section. However, the masking threshold computation can also be used in conventional speech coders to improve their performance, as illustrated in the following sections.

Use of Auditory Model in CELP Coders (PERCELP) The foregoing technique of identifying the perceptually important portions of the speech spectrum and the masking threshold also enables the replacement of the weighting filter 5 in a traditional CELP coder (Fig. 1A), as follows. The resulting coders are called Perceptual Code Excited Linear Prediction (PERCELP) coders. These coders are based on the hypothesis that the perceptually important regions of the spectrum should be synthesized with minimal distortion, and the masked regions should remain masked after synthesis. The noise level in the synthesized waveform would thus be very small in the regions above the masking threshold and slightly below the masking threshold in the masked regions.

Discrete samples of the spectrum amplitude which fall below the masking threshold may therefore be replaced by zero values without adding audible distortion.

This may also be iewed from the perspective of rate-distortion theory, where the masking threshold is the maximum allowable distortion.

This is considerably different to the conventional weighting filter approach, in which the noise level tends to be shaped so that it is everywhere just below the spectral envelope. Accordingly, PERCELP coders should produce speech quality significantly better than that obtained when sub-optimal weighting filters are used.

The approach in PERCELP is to quantize the areas of the speech spectrum which lie above the masking threshold with minimal distortion, while quantizing those regions below the masking threshold with as much di tortion as the masking threshold will allow.

This is an attempt to synthesize the perceptually significant regions of the spectrum with greater fidelity.

From a traditional weighting filter point of view, this method emphasizes the error in the perceptually significant regions of the spectrum while (almost) "ignoring" errors in other spectral regions. Further, the fact that there is no simple relation between the vocal tract filter, modelled by the linear prediction (LP) filter, and the masking threshold, means that the analysis is best carried out in the frequency/bark domain.

The ideas of PERCELP have been tested by applying them to a speech coder similar to 4.8 kbit/s Federal Standard 1016. Telecommunications: Analog to Digital Conversion of Radio Voice by 4.800 bit/second CELP, National Communications System, Office of Technology and Standards, Washington DC, 14 February 1991 (FS1016). A frame length of 30 ms with four sub-frames is used, with 60 samples per sub-frame, The coder produces the parameters of 10 LP coefficients per frame, as well as adaptive and stochastic codebook indices and gains each sub-frame. The parameters are SUBSTITUTE SHEET (Rule WO 94/25959 PCT/AU94/00221 -9quantized to 142 bits (using the FS1016 quantization tables), resulting in an overall rate of 4.66 kbits/s.

Masking Threshold Analysis The masking analysis is performed in the frequency domain using the arrangement 20 shown in Fig. 9. The arrangement 20 inputs digital speech 21 to a zero padding and window unit 22 where 60 samples per sub-frame are padded with 68 zeros and windowed (using a Hamming window) before an FFT analysis in an FFT processor 23. The length of the padding is chosen not only to take advantage of the fast algorithm of the FFT but also because of circular convolution that has to be accounted for (due to other parts of the algorithm discussed later).

Once in the frequency domain, a selected auditory model 24 is used to identify the perceptually significant (unmasked) regions of the spectrum. The output of the auditory model 24 is a 64 sample binary array reflecting the level of each spectral sample in relation to the masking threshold. The array may be defined as follows, where level[i] represents the sound pressure level of the signal: ma, level[i]> 0 level[i] threshold for i= 1 to 64.

The masking threshold is incremented in equation by an offset of 10 dB to take advantage of results presented above.

Short Term Spectral Analysis and Pitch Analysis A 10th order Linear Prediction (LP) filter is used to model the vocal tract. A total of 34 bits, is used to quantize the LP coefficients in the manner specified in FS1016.

The filter parameters (co-efficients) of the LP ilter are chosen to minimize the error between the filter response and the speech spectrum only in the regions of the spectrum found to be perceptible by the perceptible hearing analysis.

The pitch analysis is carried out using the established technique of an adaptive codebook, and requires the searching of 128 integer and 128 non-integer delays stored in an adaptive array. The final excitation (the sum of the pitch excitation and the stochastic codebook excitation) is then used to update the adaptive array. The pitch gain is quantized each sub-frame using 5-bit non-uniform scalar quantization. The total number of bits required to transmit the pitch information is thus 52 bits per frame.

The pitch analysis is unmodified in PERCELP at present, mainly because of the complexity of transforming each of the integer and non-integer delay vectors into the frequency domain every sub-frame. This would be required, as it is not possible to prestore these frequency domain vectors because the values in the adaptive array are constantly changing.

SUBSTITUTE SHEET (Rule 2b) II~ WO 94/25959 PCT/AU94/00221 Frequency Domain Stochastic Codebook The stochastic codebook contribution is often blamed for the inherent noisiness (often described as background buzziness) of CELP coders. Due to this recognized inaccuracy of the stochastic codebook, and the fact that the final level of the noise is determined by its contribution, it seems the most likely candidate for an auditory analysis to achieve a perceptual improvement. In order to perform a selective search across the spectrum using the masking analysis, the stochastic codebook search algorithm has to be performed in the frequency domain.

In order to minimize the distortion only in the unmasked portions of the speech spectrum, the stochastic codebook (512 size) analysis is performed only in these regions of the spectrum. The optimum code vector is given by the codebook entry which maximizes Mk defined by: 64 ,2 /64 Mk Zmask(i) Yk(i) T(i) LYk 2 (6) i=1 where Yk(i) is the filtered codeword (obtained by multiplying h(i) the 128-point FFT of the all-pole LP filter impulse response with xk(i) the kth codebook entry), T(i) is the target vector given by subtracting the adaptive codebook excitation from the LP residual, and mask(i) is the binary vector from the auditory masking analysis. The corresponding optimum gain is given by: 64 64 gk mask(i) Yk(i) T(i) k 2 (7) i= )1 i= Equations 5 and 6 may easily be derived by minimizing the noise energy only in the unmasked regions of the spectrum. The search for the optimum stochastic code vector is shown diagrammatically in Fig. The interfacing of the auditory model to a standard CELP coder is carried out in two steps to form a PERCELP configuration as shown in Fig. 11. Here, a PERCELP coder 50 is shown in which digital speech 52 is input to an adder 54 configured to subtract from the digital speech the output of the zero impulse filter (ZIR) 56. The output of the adder 54 is supplied to a further adder 58 which receives the synthesised speech signal Sn received from a formant filter 56, and outputs an error signal 76. The error signal 76 is supplied to a minimum squared error search 72 similar to that of the prior art which is used to make selection from an adaptive codebook 64. The error signal 76 is also supplied to a new error search incorporating the auditory masking 74 and which corresponds to the arrangement shown in Fig. 10 which is used to make selections from a stochastic codebook 66.

The speech signal 52 is also input to an auditory masking analysis 62 which implements the auditory model in the manner described above. The output of the auditory masking analysis 62 supplies further input to the new error search 74. The selected SUBSTITUTE SHEET (Rule 26) WO 9,1/25959 1ICT/A1, 194/0020821 -11 outputs from each of the codebooks 64 and 66 are summed in a summer 68 and applied to a post-filter 70, the output of which is returned for updating the adaptive codebook 64, as well as being applied to the formant filter 60. Codebook gain units 78 and 80 are also provided.

It will be apparent from the above that the PERCELP coder 50 differs from the standard CELP coder of the prior art. Firstly, the search process for the optimum stochastic codebook index and codebook gain is modified to minimize the error only in the perceptually significant portions of the spectrum. Secondly, the post-filter 70 is used to truncate all sections of the combined adaptive and stochastic excitation spectrum which the auditory analys.I indicates are below the masking threshold. This second step is necessary because the error has not been minimized in these regions of the spectrum. The truncated excitation is used to update the adaptive codebook 64.

Whilst the use of the post-filter 70 immediately before or after the formant filter appears as the simplest way to truncate the synthesized speech spectrum, this results in a non-optimized adaptive codebook update. It was therefore decided to place the postfilter 70 inside the adaptive codebook loop, as shown in Fig. 11. An alternative would have been to place it directly at the output of the stochastic codebook 66, but use in that position leads to subjectively poorer quality.

Quality Of Synthesized Speech using PERCELP Examples of the noise level? using CELP and PERCELP, along with the original speech spectrum and masking thresholds for a particular frame, are shown in Figs. 1C and 12, respectively. As may be observed, the noise level for PERCELP is predominantly under the masking threshold.

It is clear that post-filtering the excitation in Fig. 12 has kept the noise level below the masking threshold in the masked regions of the spectrum, and is therefore perceptually optimum for these regions. Also, the noise energy in the unmasked regions is mostly lower in these regions than in the CELP coder in Fig. 1C, where the noise energy has been minimized across the whole spectrum rather than just the unmasked regions.

The perceived effect of this is a much smoother, more natural sounding synthesized speech. The most noticeable effect is the lack of the background 'buzzin,-s' found in most CELP coders.

However, the noise level in the unmasked regions in Fig. 12 is still above the corresponding speech level in some parts of the spectrum, even though it is significantly lower in other regions of the spectrum. Fig. 12 with equations and illustrates an arrangement termed by the inventors as minimization of Total Unmasked Noise (TUN), in the synthesised speech. Fig. 12 suggests that there may be further scope to perceptually minimize the noise in the unmasked regions, as follows.

SUBSTITUTE SHEET (Rule 2b6 WO 94/25959 ICT/AU94/00221 -12 The required improvement to the results in Fig. 12 is obtained by minimizing only the noise energy which lies above the masking threshold. The minimization is still carried out in the unmasked regions of the spectrum only however the error energy that lies below the masking threshold is now ignored. The rationale for this is that, since the noise energy that lies below the masking level is imperceptible, it may safely be allowed to remain. This technique will tend to distribute the unavoidable unmasked noise uniformly above the masking threshold.

This approach requires the calculation of actual noise levels across the spectrum and therefore increases the computational complexity. The total Noise energy Above the Masking (NAM) threshold energy for code vector k is calculated as follows: 64 2 1 NAMk Ik(i) (lEk 2 (8) i=1 where M(i) is the masking threshold across the spectrum, Ek(i) is the error due to the kth 1, l Ek 2

M

2 (i) code vector, and Ik(i) Ek 2 (i)>M 2 i (9) 0, Ek2(i) M2( The code vector which results in the lowest value of NAM k is chosen from the 512 codebook entries. The foregoing error calculation is depicted in Fig. 13.

The noise level resulting from this modified codebook search algorithm is displayed in Fig. 14. The perceived quality of the synthesized speech is slightly better than that from the previous algorithm (Fig. 12), although the signal-to-noise ratio with PERCELP is lower than with the CELP algorithm. This is because the total noise energy has not being minimized, but only that above the masking threshold.

Real Time Implementation of PERCELP For practical real time implementations of PERCELP, the decoder will not have access to the masking information to calculate the combined excitation. One solution to this problem is to transmit extra information about the masked portions of the spectrum.

This however increases the bit rate, and is only be realistic if the masking analysis were to be carried out once per frame, since it would require about 64 bits an extra 2.1 kbits/s).

A preferable solution is to compute the masking threshold both at the decoder and the encoder based on information that is known to both encoder and decoder. The shortterm spectral envelope (via the LP coefficients) and the adaptive excitation are such information. The masking analysis could be carried out on either the envelope alone, or on the envelope excited by the adaptive excitation.

Quality and Computational Complexity of PERCELP Listening tests of PERCELP applied to a 4.8 kbps speech coder have shown that the perceptual quality of the synthesized speech is significantly better than that of a SUBSTITUTE SHEET (Rule 26) WO 94/259,59 PCT/AU94/00221 -13conventional implementation using a weighting filter. It is more natural and lacks the inherent noise of CELP coders, which is often attributed to a non-optimum choice of stochastic codebook index and gain.

The computational overhead associated with the auditory model is small enough to be included in single-DSP full-duplex implementations of CELP coders at 4.8 kbps. 'ihe computational overhead of the current implementation is due in part to the frequency domain stochastic codebook. Existing techniques which minimize the computation as well as the storage requirements should make this overhead negligible.

Fig. 15 illustrates a co::iguration of a PERCELP coder which can be formed as an application specific integrated circuit (ASIC) 100.

The ASIC 100 includes a PCM module 102 which receives and outputs analog speech 101, generally band 'imited between 300Hz 3300Hz as in telephony systems. A digital sign. processor (DSP) 104 receives digital speech, 8 bits sampled at 8kHz giving 64 kbps, from the PCM Module 102, and is programmed to implement PERCELP coding and decoding as described above using a stochastic codebook initially stored in a ROM 106, but transferred to a RAM 108 to permit high speed access during operation. The RAM 108 also stores the adaptive codebook. The DSP 104 outputs digital speech at 4.8 kbps to a telecommunications channel 100. A programmable logic device (PLD) 112 is used to "glue" or otherwise link the other components of the ASIC 100.

When the present invention is embodied in a multiband excitation (MBE) speech coding system, a perceptual model of digitized speech is used in the manner described above to determine the perceptible sections of the speech spectra. Periodic or noise excitation signals are then passed through a number of bandpass filters which output the synthesized speech. Parameters of the excitation signals are then selected to minimise the error in the same manner as in the previous embodiments. Post-filtering of the synthesized speech spectra can also be used as before.

Accordingly, the speech synthesis system disclosed herein has application to the telecommunication industry and similar industries where digital speech is being conveyed or stored.

The foregoing describes only a number of embodiments of the present invention and modifications, obvious to those skilled in the art, can be made thereto without departing from the scope of the present invention.

SUIBSTUTE SHEET (Rule 2b)

Claims

1. A method of determining an optimum excitation in a speech synthesis system, said method comprising the steps of: analysing an input speech signal on the basis of an auditory model to identify perceptually significant components of said input speech signal; and selecting from a plurality of candidate excitation signals to the system, an excitation signal which results in a system output optimally matched to said input speech signal in the perceptually significant components of step

2. A method as claimed in claim 1, wherein said auditory model is a masking model.

3. A method as claimed in claim 1, wherein the masking model is selected from the group consisting of "Zwicker's model", "Terhardt's model", "Hellman's model", "Allen's model", "Ghitza's model", "Lyon's model", "Seneff's model", "Johnston's model", derivatives of same, and non-simultaneous masking models.

4. A method as claimed in claim 1, wherein the speech synthesis system comprises a code-eci<d- linear prediction arrangement.

A method as claimed in claim 4, wherein the auditory model is used to select from a plurality of codebooks used in said arrangement an optimum codebook entry and gain which form part of said excitation in said arrangement.

6. A method as claimed in claim 5, wherein the selected codebook entry is selected from any one or more subsets of said codebooks.

7. A method as claimed in claim 1, wherein the speech synthesis system is selected from the group consisting of: a multiband excitation arrangement, a linear prediction arrangement, and an arrangement employing filter plus excitation models of speech.

8. A method as claimed in claim 1, wherein the analysis of said speech signal and the selection of said candidate excitation signals are performed in the frequency domain.

9. A method as claimed in claim 1, wherein the analysis of said speech signal and the selection of said candidate excitation signals are performed in the time domain.

A method as claimed in claim 1, wherein the analysis of said speech signal is performed using time domain aliasing cancellation.

11. A method as claimed in claim 1, wherein the auditory model is configured to control criteria by which the optimal matching of the system output and said input speech signal is determined. SUBSTITUTE SHEET (Rule 26) WO 94/25959 PCT/A U94/00221

12. A method as claimed in claim 10, wherein the total energy of noise components that exceed a masking threshold of said auditory model is passed to said system output (TUN).

13. A method as claimed in claim 10, wherein the partial energy of noise components above a masking threshold of said auditory model is passed to said system output (NAM).

14. A method as claimed in claim 12 or 13, wherein the noise components are weighted across the frequency spectra of said input speech signal.

A method as claimed in claim 1, comprising the further step of cancelling from the excitation signal those portions which are determined masked by said auditory model.

16. A method of determining an optimum codebook entry in a code-book excited linear prediction (CELP) speech coding system, the method comprising the steps of: using a perceptual hearing model of digitised speech to determine those perceptible sections of the speech spectra; passing a stochastic codebook vector through a linear prediction filter to produce synthesised speech; selecting a codebook vector which minimises an error between the digitised and synthesised speech only in the regions of the spectrum found to be perceptible in step and post-filtering out those regions of the synthesised speech spectrum found to be imperceptible according to the hearing model, and using the remaining codebook entries and linear prediction filter parameters in said system.

17. A method as claimed in claim 16, wherein the parameters of' the linear prediction filter are chosen to minimize an error measure between the filter response and the speech spectrum only in the region of the spectrum found to the perceptible by the perceptual hearing analysis.

18. A method of determining an optimum codebook entry in a multiband excitation speech coding system, the method comprising the steps of: using a perceptual hearing model of digitized speech to determine the perceptible sections of the speech spectra; passing periodic or noise excitation signals through a multiplicity of bandpass filters to produce synthesized speech; selecting the parameters of such excitation signals to minimize an error between the digitised and synthesised speech only in the regions of the spectrum found to be perceptible in step and SUBSTITUTE SHEET (Rule 26) -16- post-filtering out those regions of the synthesised speech spectrum found to be imperceptible according to the hearing model, and using the remaining codebook entries and linear prediction filter parameters in said system.

19. Apparatus configured to implement the method as claimed in any one of the preceding claims.

A method of determining an optimum excitation in a speech synthesis system substantially as described herein with reference to Figs. 2 to 12, or Figs. 2 to 13 and 14 of the drawings.

21. Apparatus for determining an optimum excitation in a speech synthesis system substantially as described herein with reference to Figs. 10, 11, 13 or 15 of the drawings. DATED this Third Day of December 1996 Unisearch Limited Patent Attorneys for the Applicant 15 SPRUSON FERGUSON 0* *o 0 0. 0 0 0 0 0.0. 4- C-)cS IN:\LIDpp0OO025:IAD