CA2259374A1

CA2259374A1 - Speech synthesis system

Info

Publication number: CA2259374A1
Application number: CA002259374A
Authority: CA
Inventors: Costas Xydeas
Original assignee: Individual
Current assignee: Victoria University of Manchester
Priority date: 1996-07-05
Filing date: 1997-07-07
Publication date: 1998-01-15
Also published as: AU3452397A; EP0950238A1; JP2000514207A; DE69724819D1; WO1998001848A1; ATE249672T1; EP0950238B1

Abstract

A speech synthesis system in which a speech signal is divided into a series of frames, and each frame is converted into a coded signal including a voiced/unvoiced classification and a pitch estimate, wherein a low pass filtered speech segment centred about a reference sample is defined in each frame, a correlation value is calculated for each of a series of candidate pitch estimates as the maximum of multiple crosscorrelation values obtained from variable length speech segments centred about the reference sample, the correlation values are used to form a correlation function defining peaks, and the locations of the peaks are determined and used to define a pitch estimate.

Description

CA 022~9374 1998-12-29 t WO 98lol8~8 PCT/GB97/O~l SPI~FCH S~fNTHF~s~s SYSTF.M

The present invention relatcs to speech synthesis systems, and in particular to speech systems codin~ and synthesis systems which can be used in speech communication systems operating at low bit rates.
Speech can be represented as a waveform the detailed structure of which represents the characteristics of the vocal tract and vocal excitation of the person producing the speech. If a speech communication system is to t~e capable of providing an adequate perceived ~uality, the transmitted information must be capable of representing that detailed structure. ~ost of the power in voiced speech is at relatively low frequencies, for example below 2kHz. ~ccordingly good quality speech synthesis can be achieved on the basis of speech waveforms that have been low pass filtered to reJect higher frequency components. The perceived speech quality is however adversely effected if the frequency is restricted much below 4kH~:.
Many models have been suggested for defining the characteristics of speech. The known models rel~ upon dividing a speech signal into blocks or frames and deriving parameters to represent the characteristics of the speech within each frame. Those parameters are then quantized and transmitted to a receiver. At the receiver the quantization process is reversed to recover the parameters, and a speech signal is then synthesised on the basis of the recovered parameters.

~~ T WO 98/01848 PCT/GB97/O~BI

Thc common objcctivc of the designers of thc known models is to minimice thc volume of data which must bc transmittcd whilst msl~imising the perccived quality of the spcech that can bc synthcsised from thc transmitted data. In some of the models a distinction is made betwecn whether or not a particular frame is "voiced" or "unvoiccd". ln thc case of voiced speech, speech is produced by glottal excitation and as a result has a quasi-periodic structure.
Unvoiced speech is produced l)y turbulent air flo~ at a constriction and does not have the "periodic" spcctral structure charactcristic of voiced speech. Most models seek to take advantage of the fact that voiced speech signals evolve relatively slowly in the context of frames the duration of which is typically 10 to 30msecs. Most models also rely upon quantization schemes intended to minimi.~e the amount of information which must bc transmitted without ci~nific~nt loss of perceived quality. As a result of thc ~ork done to date it is now possible to produce speech synthesis systems capable of operating at bit rate of only a few thousand bits per second.
One model which has becn dcveloped is known as "sinusoidal coding"
(R.J. McAulay and T.F. Quatieri, "Low Rate Speecll Cofli~, ~ased on Sinusoidal Coding", Advances in Speech 5~ignal Processing, I~ditors S. Furui and M.M.
Sondhi, Chapter 6, pp. 16~-208, Markel Dekker, New York, 1992). This approach relies upon an FFT analysis of each input frame to produce a magnitude spectrum, cstimating the pitch period of the input frame from that spectrum, and defining the ~lmplitudes at the pitch relatcd harmonics, the ,~ ~ W O 98/01848 ~CT/GB97/Olg~

harmonics being multiples of the fundamcntal frcqucncy of the framc. An error measure is calculated in thc timc domain representing the diffcrence bctween harmonic and aharmonic spcech spcctra and that crror measurc is used to definc the degree of voicing of the input frame in terms of a frequency valuc. Thus the parameters used to rcpresent a framc arc the p;tch period, the magnitudc and phase values for each harmonic, and the frequency value. Proposals have been made to operate this system such that phase information is l)redictcd in a coherent way across successive frames.
~ n another system known as "multiband excitation coding" (D.W. Grif~m and J.S. Lim, "Mul~;~and F~it~rtjo/t Vocoder" IE~E Transaction on Acoustics, Speech and Signal Proccssing, vol. 3G, pp 1223-1235, 1988 and Digital Voice Systems Inc, "INMA~SAT M Voice Co~lec, Version 3.0", Voice Coding System Des~ ion, Module 1, Appendix 1, August 1991) thc amplitude and phase functions are determined in a diffcrent way frorn that employed in sinusoidal coding. The emphasis in this system is placed on dividing a spcctrum into bands, for cxample up to t~,velve bands, snd evaluating thc voiced/unvoiced nature of each of these bands. Bands that arc classified as unvoiced are synthesised using -random signals. Where the difference bctween the pitch estimates of successive frames is relatively small, linear interpolation is used to dcfine the required amplitudes. The phase function is also defined using linear frequcncy . ~ interpolation but in addition includes a constant displacement ~Yhich is a random variable and which depends on thc numbcr of unvoiced bands prescnt in thc J T WO 98/01848 PCT/GB97/01~1 short term spectrum of the input signal. The systcm works in a way to prescrve phase continuity between succcssive frames. W31en the pitch estimates of successive frames are significantly different, a weighted summation of signals produced from amplitudes and phases derived for successive frames is formed to produced the synthesised signal.
Thus the common ground between the sinusoidal and multiband systems referred to above is that both schemes directly model thc input speech signal which is DFT analysed, and both systems arc at least partially bascd on the same filnd~l~nental relationship for rcpresenting speech to l)c synthesised. The systems differ ~owever in terms of the way in which amplitudes and phase are estim~tçd and quantized, the way in which different interpolation methods arc used to define the necessary phase relationships, and the way in which 'Lrandomness" is introduced in the recovered speech.
Various versions of thc multiband excitation coding system have been proposed, for e~ample an enhanced multiband excitation speech coder (A. Das and ~ Gersho, Yariable-Dimension Spcctral Co~ling of Speecll nt 2400 bps and below witlt pl~onetic classifcafioJt~ EE Proc. lCASSP-95, pp. 492-495, May 1995) in which input frames are classified into four types, that is noise, unvoiced, fully voiced and mixed voiced, and a variable dimension vector quantization process for spectral magnitude is introduced, the bi-harmonic spectral modelling system (C. Garcia-Matteo., J. L. Alba-Castro and Eduardo R. ~anga, "Speech Coding Using ~i-Harmonic Spectral Modelling", Proc. EUSIPCO-94, " ~ W 098/01848 PCT/GB9710~1 Edingburgh, Vol. 2, ~p 391-394, Scptembcr 1994) in which thc short term magnitude spectrum is divided into two bands and a scparatc pitch frequency is calculated for each band, thc spcctral cxcitation codint, s~stem (V. Cuperman, P
Lupini and B. Bhattacharya, "S/~ectral Frc;~nfioll Coding of Speecl~ at 2.4 kb/s", IEEE Proc. ICASSP-95, pp. 504-507, Detrpot, May 1995) which al~plies sinusoidal based coding in the lincar predictivc coding (LPC) residual domain where the synthesiscd residual signal is thc summation of pitch harmonic oscillators ~vith appropriate amplitudc and phasc functions and amplitudes are quant~zed using a non-squarc transformation, thc l~and-widencd harmonic vocoder (G. Yang, G Zanellato and EI. Leich, "Banfd Widened ~armonic Vocoder at 2 to 4 kbps", IEl~E Proc. 1CASSP-9:5, pp. ~;04-507, Detroit, May 1995) in which randomness in the signal is introduced by a(lding jitter to the amplitude information on a per band basis, pitch synchronous multiband coding (H. ~ang, S. N. Koh and P. Sivaprakasapilai, "Pitcl~ Syncltrollo- s Multi-Band (PSMB) Speec/t Coding", IF~J~E Proc. ICASSP-95, pp. 5~6-519, Detroit, May 1995) in which a CELP ~code-excited linear prediction) based coding scheme is used to encode speech period segments, rnulti band LrC coding (S. Yeldener, M. Kondoz and G. Evans, "Hi~l~ Q~ality Mul~iband LPC Co~ling of Speecl~ nt 2.4 kbits/s", Electronic Letters, pp. 1287-1289, Vol. 27, No 14, 4th July 1991) in which a single ~ amplitude value is allocated to each trame to in effect specify a "flat" rcsidual . ~ spectrum, and harmonic and noise coding (M. Nishiguchi and J. Matsumoto, "~Iarmo~ic nnd Noise Codil~g of LPC Residunls IVi~/~ Classif ed Vector - - - - - -PCT/GB9i/01~3 1 QrJn~1fi~0tion~ EE rroc. ICASSP-95, pp. 484-487, Dctroit, May 1995) with classified vector quantization which opcrates in thc LPC rcsidual domain, an input signal being classificd as voiccd or unvoiccd and being full band modelled.
A further type of coding system exists, that is the prototype interpolation coding system. This rclies upon the use of pitch period segments or prototypes which are spaced apart in timc and rciteration!interpolation tcchniques to synthesise the signal bchYeen two prototypes. Such a system was described as early as 1971 (J.S. Severwight, "Interpolation I~citerations Techniques for ~:fficient Speech Tr~n~mi~sion", rh.D. Thesis, Loughborough University, Department of Electrical Engineering, 1971). More sophisticated systems ofthe same general class have been dcscribed more recently, for cxample in thc paper by W.B. Kleijn, "Continuous Representations in Lincar Predictivc Coding, Proc.
lC~SSP-91, pp21~ ;!04, May 1991. The same author has published a series of related papers. The systcm employs 20msecs coding framcs which are classified as voiced or unvoiced. IJnvoiced frames arc effectively Cl~:LP coded. Pitch prototype segments are defined in adjacent voiced frames, in the LPC residual signal, in a way which cnsures maximum alignment (correlation) of the prototypes and defines the prototype so that the main pitch excitation pulse is not near to either of the ends of the prototype. A pitch period in a given frame is considered to be a cycle of an artificial periodic signal from which the prototype for the frame is obtained. The prototypes which have been appropriately CA 022~9374 l998-l2-29 W O98/01~48 PCT/GB97/018~1 selected from adiacent frames are l~ourier transformed and the resulting ~ coeffieients are coded using a differential vector quantization scheme.
With this scheme, durin~ synthesis of voiced frames, the decoded prototype Fourier representations for adjacent frames are used to reconstruct the mi~sing signal waveform l)etween tlle two prototyl)e segments using linear interpolation. Thus the residual signal is obtained which is then presented to an LPCI synthesis filter the output of which provides tlle synthesised voiced speech signal. An amount of randomness can be introduce~l into volced sr~eech by injecting noise at frequencies larger than 2khz, the amplitude of the noise increasing with frequency. In addition, the periodicity of synthesised voiced speech is controlled during tlle quantization of prototype parameters in aceordance with a long term signal to change ratio measure that reflects the similarity whieh exists ~etween the prototypes of adjacent frames ilt the residual excitation signal.
The known prototype interpolation coding systems rely upon a Fourier Series synthesis equation which involves a linear-with-time-interpolation proeess.
The assumption is that the pitch estimates for successive frames are linearly interpolated to provide a pitch function and an associated instant fundamental frequeney. The instant phase used in the eosine and sine terms of the l?ourier series synthesis equation is the integral of the instantaneous harmonic ~ frequeneies. This synthesis arrangement allows for the linear evolution of the CA 02259374 1998-12-29'--''''- ' =' ' -'~ '''' WO 98/01~48 PCT/GB97/01~31 instantaneous pitch and the non-linear evolution of the instantancous harmonic frequencies.
~ development of this system is dcscribed by W.B. Kleijn and J. Ha~-lçn, "A Speech Coder Based on Decomposition of Characteristics Waveforms", Proc.
ICASSP-95, pp5~-511, Detroit, May 1995. In the described system the Fourier series coefficients are low pass filtercd over time, with a cut-off frequency of 20Hz, to provide a "slowly cvolving" ~aveform component for thc LPC
excitation signal. Thc dif~rence bchveen this low pass component and the original parameters provides the "rapidly evolving" components of the excitation signal. Periodic voice excitation signals are mainly represented by thc "slowly evolving" component, ~vhereas random unvoiccd cxcitation signals are represented by the "rapidly evolving" component in this dual decomposition of the Fourier series coefficients. This rcmoves effectively the need for treating voiced and unvoiced frames separatcly. Furthermorc, the rate of quantization and tr;~n~mic~ n of the two components is different. The "slowly evolving"
signal is sampled at relatively lon~ intervals of 25msecs, but the parameters are quantized quite accurately on the basis of spectral magnitndc information. In -contrast, the spectral magnitude of thc "rapidly cvolving" signal is sampled frequently, every 4msecs, but is quantized less accurately. Phase information is ra~domised every 2msecs.
Other developments of the prototype interpolation coding systcm have been proposed. ~or examplc one kno~vn system operates on 5msec frames, a CA 022~9374 1998-12-29 . ~ W O98/01848 PCT/GB97/01~1 pitch period being sclccted for voiced frames and DliT transformcd to yield prototype spectral magnitudc valucs. Thcse valucs arc quantizcd and the quantized valucs for adjaccnt fi-amcs are lincarly intcrpolatcd. rhase information is defined in a manncr whicll does not satisfy any frcquency restrictions at the intcrpolation boundarics This causcs problems of discontinuity at framc boundaries. At thc rcceiver the excitation signal is synthesised using a dcco~lcd magnitudc and cstimatcd phase values, via an inverse DFT process. Thc resulting signal is filtcred by a following LPC
synthesis filter. This modcl is ~urcly pcriodic during voiced spcecll, and this is why a very short duration framc is uscd. Unvoiced speech is Cll:LP coded.
The wide range of speech synthesis models currcntly being proposed, only some of which are dcscribcd abovc, and the range of alternativc approaches proposed to implement thosc models, indicates thc interest in such systems and the lack of any conscnsus as to whicll system providcs thc most advantageous performance It is an object of thc prescnt invention to provide an improved low bit rate speech synthesis system.
In known systems in which it is necessary to obtain an estimate of the pitch of a frame of a spccch signal, it has been thought neccssary, if high quality of synthesised speech is to be achievcd, to obt,tin high rcsolution non-integer pitch period estimates. This rcquircs complex processes, and it would bc highly CA 022~9374 1998-12-29 WO98/01848 PCTIG~9710~1 ~ 10 desirable to reduce the complexity of the pitch estimation process in a manner which did not result in degraded quality.
Aecording to a ~Irst aspect of the present invention, therc is provided a speech synthesis system in ~vhic}l . speech signal is divided into a series of frames, and each frame is convcrted into a coded signal including a voicedlunvoieed classification and a pitch estimatc, wherein a low pass filtered speech segment eentrcd about a reference samplc is defined in each frame, a correlation value is calculated for each of a series of candidate pitch estimates as the maximum of multiple crosscorrelation values obtained from variable length speech segments centred about the reference'sample, tlle correlation values are used to form a correlation function defining peaks, and the locations of the peaks are determined and used to define a pitch estimate.
The result of the above system is that an integer pitch period value is obtained. The system avoids undue complexitv and may he readi'h~ implemented.
Preferably the pitch estimate is defined using an iterative process. A
single reference sample may be used, for example centrcd with respect to the respective frame, or alternatively multiple pitch estimatcs may be derived for each frame using different reference samples, the multiple pitch estimates being combined to define a combined pitch estimate for the frame. The pitch estimate may be modified by reference to a voiced/unvoiced status and/or pitch estimates of adjacent frames to define a final pitch estimate.

CA 022~9374 1998-12-29 j I ~ WO 9X/01848 PCT/GB97/01831 The correlation function may bc clipped using a threshold value, remaining peaks being rejected if they are adjacent to largcr peaks. Peaks arc initially sclccted and can bc rejccted if they are smaller than a following peak by more than a predetermined factor~ for example smaller than 0.9 timcs the following peak.
Preferably the pitch estimation procedure is based on a least squares crror algorithm. Prcferably thc algorithm defines the pitch as a number whose multiples best fit thc correlation function pcak locations. Initial possiblc pitch values may bc limited to integral numbers which are not consecutive, the increment between two successive numbers being proportional to a constant multiplied by the lower of those two numbers.
It is well known from the prior art to classify individual frames as voiced or unvoiced and to process those frames in accordance with that classification.
IJnfortunately such a simple classification process docs not accurately reflect thc true characteristics of speech. It is often the case that individual frames are made up of both periodic (voiced) and aperiodic (unvoiced) components. Prior attempts to address this problem have not provcd particularly effective.
- It is an object of the present invention to provide an improved voiced or unvoiced classification system.
According to a second aspcct of the prcscnt invcntion there is provided a speech synthesis systcm in which a speech signal is divided into a series of frames, and cach frame is converted into a codcd signal inc~uding pitch segment , _ _ CA 02259374 1998-12-29~

WO 98/01848 PCl'tGB97/01~1 magnitudc spectral information, a voiced/unvoiced classification, and a mixed voiced classification which classifics harmonics in thc magnitude spectrum of voiced frames as strongly voiccd or weakly voiccd, whcrcin a scrics of samples ccntred on the middle of the framc arc windowcd to form a data array which is Fourier transformed to producc a magnitudc spectrum, a thrcshold value is calculated and used to clip the magnitude spcctrum, the clipped data is searched to define peaks, the locatlons of pcaks are dctermined, constraints are applicd to define donlin~nt peaks~ and harmonics not associated with a dominant peak are c~ified as weakly voiced.
Peaks may be located using a second order polynomial. The samples may be ~:-rnming windowed. The threshold value may be calculated by identifying the maximum and minimum magnitude spccl- Ul-l values and defining the threshold as a constant multiplied by the difference between the maximum and minimum values. Peaks may be dcfined as thosc valucs which arc greatcr than the two adjacent values. A pcak may bc rcjected from consideration if neighbouring peaks are of a similar magnitude, e.g. more than 80% of the ms-gnit~rll, or if there are spectral rn~gnitndes in the same range of greater magnitudes. A harmonic may bc considercd as not being associated with a dominant peak if the difference between two adjacent pcaks is greater than a predetermined threshold valuc.
The spectrum may bc divided into bands of fixed width and a strongly/weakly voiced classification assigned for each band Alternatively the -, I W O 98/01848 PCT/GB97/0 ~ 1 frequency range may be divided into two or more bands of variable width, adjacent bands being separated at a frequency selected by reference to the strongly/weakly voiced classification of harmonics.
Thus~ the spectrum mav be divided into fixed bands, for example fixed bands each of SOOHz, or variable ~idth I)ands selected in dependence upon the strongly/weak~y voiced status of harmonic components of the exeitation signal. ~
strongly/wealcly voiced classification is then assigned to eaeh band. The lowest frequeney band, e.g. O-SOOHz, may always be regarded as strongly voieed, whereas the highest fre~ueney band, for example 3500Hz to 4000Hz, may al~ays be regarded as weakly voieed, In the event that a eurrent frame is voieed, and the previous frame is unvoieed, otller bands within the current frame, e.~.
3000~z to 3500EIz may be automatieally classified as weakly voiced. Generally the strongly/weakly voiced classifieation may be determined using a majority decision rule on the strongly/weakly voiced classi~cation of tllose harmonics whieh fall within the band in question. If there is no majority, alternate bands may be alternately assigned strongly voiced and weakly voiced classifications.
Given the classification of a voiced frame such that harmonics are classified as either strongly or weakly voiced, it is neeessary to generate an it~tion signal to recover the speech signal which takes into account this classification. It is an object of the present invention to provide such a system.
Aeeording to a third aspect of the present invention, there is provided a speeeh synthesis system in which a speech signal is divided into a series of CA 022~9374 1998-12-29 W O 98/01~48 PCT/G~9710 ~ 1 frames, each frame is defined ~s voieell or unvoiced, each frame is converted into a eoded signal ineluding a pitch period value, a frame voiced/unvoieed classifieation and, for eaeh voieed frame, a mixed voieed speetral band elassifieation whieh elassi~les harmonics within spectral bands as either strongly or weaWy voieed, and the speecll sit nal is reconstructed by generating an exeitation signal in respect of each frame and applying the excitation signal to a filter, wherein for each weakly voiced speetral band, an excitation signal is generated which ineludes a random eomponent in the form of a function which is dependent upon the respective pitch period value.
Thus for eaeh frame whieh has a speetral band that is classified as weakly voiced, the exeitation signal is represented by a funetion which includes a first harmonic frequency eomponent, the frequency of which is dependant upon the piteh period value appropriate to that frame, and a seeond random eomponent whieh is superimposed upon the first eomponent.
The random eomponent may he introdueed by reducing the amplitude of harmonie oseillators assigned the ~eakly voiced classification, for example by redueing the power of the harmonies by 50%, while disturbing the oseillator frequeneies, for example by shifting the oseillators randomly in frequeney in the range of O to 30 Hz sueh that the frequeney is no longer a multiple of the fundamental frequency, and then adding further random signals. The phase of the oseillators producing random signals may be randomised at pitch intervals.

CA 022~9374 1998-12-29 ' W O98/01~48 PCTIGB97/0~1 Thus for a ~veakly voiced band, some periodicity remains but thc power of the periodic component is reduced and then combined with a random component.
In a speech synthesis system in which a speech signal is represented in part by speetral information in tlle form of harmonic magnitudc values, it is possible to process an input speech signal to produce a series of speetral magnitude values and then to use all of those magnitude values at harmonic loeations in subsequent proeessing steps. In many eireumstanccs however at least some of the magnitude values eontain little information which is useful in the recovery of the input speech signal. Aceordingly ~Yhen magnitudc values are quantized for transmission to a rcceiver it is sensible to diseard magnitude values whieh eontain little useful information.
In one known system an input speech signal is proeessed to produee an LPC residual signal whieh in turn is proeessed to provide harmonic magnitude values, but only a fixcd number of those magnitude values is vector qu~nti7e-1 for tr~ncmiccion to a receivcr. The disearded magnitude values arc represented at the reeeiver as identical constant values. This known system reduees redundancy but is inflexible in that the locations of the fixed number of magnitude values to be ~uantized are always the same and predctermined on the basis of assumption that may be inappropriate in partieular circumstances.
~ t is an object of the present invention to provide an improvcd magnitude value quanti_ation system WO 98/018~8 PCT/GB97/0~1 According to a fourth aspcct of the present invention, thcrc is provided a speech synthesis system in which a speech signal is dividcd into a series of frames, and each voiced frame is converted into a coded signal including a pitch period value, LPC coef~lcicnts, and pitch scgment sl~ectral magnitude information, wherein the spectral magnitudc information is quantized by sampling the LPC short tcrm magnitude s~ectrum at harmonic frequencies, the locations of the largest spcctral samples arc detcrmincd to identify which of the magnitudes are relatively more important for accurate quantization, and the magnitudes so identified are selected and vector quantized.
Thus rather than rclying upon a simple location selection strategy of a fixed number of magnitude values for quantization and transmission, for example the "low part" of the magnitude spectrum, the invention selects only those values which make a significant contril~ution according to the subjectively important LPC m~gnitude s~ectrum, thereby rcducing redundancy without compromising quality.
In one arrangement in accordance with the invention a pitch segment of Pn ~PC residual samples is obtained, where l'n is the pitch period value of the nth frame, the pitch segment is DFT transformed, the mean value of the res.~lt:~nt spectral magnitudes is calculated, the mean value is quantized and used as a norm~lic~fion factor for the selected magnitudes, and the resulting normalised amplitudes are quantized CA 022~9374 1998-12-29 W O98/01848 PCT/GB97/0~1 Alternatively, the RMS value of the pitch segment is calculated, the RMS
value is quantized and used as a normalisation faetor for the selected m~gnitndes, and the resulting normalised amplitudes are quantized.
At the reeeiver, the seleeted magnitudes are recovered, and each of the other magnitude values is reproduced as a constant value.
Interpolation coding systems which employ a pitch-related synthesis formula to recover speech generally encountor the problem of coding a variable Iength, pitch dependant speetral amplitude vector. The quantization scheme referred to above in which only the magnitudes of relatively greater impor~ance are q~n~i7f~l avoids this problem by quantizing only a fixe~ number of m~-gnit-lde values and setting the rest of the magnitude values to a constant value. Thus at the receiver a fixed Iength vector can be recovered. Such a solution to the problem however may result in a relatively spectrally nat exeitation model which has limitations in providing high recovered speech quality.
ln an ideal world output speech quality would bc maximised by quantizing the entire shape of the magnitude spectrum, and various approaehes have been proposed for coding the entire magnitude spectrum. In one approaeh, the spccLI UL~ is D~T transformed and coded differentially across successive speetra. This and similar coding schemes are rather inefficient however and operate with relatively high bit rates. The introduetion of vector quantization CA 022~9374 1998-12-29 W O98/01848 PCT/GB97/01~31 allowed for the development of sinusoidal and prototvpc interr)olation systems which operate at lower bit ratcs, typicall~ around 2.4Kbits/scc.
Two vector quantization methodologies havc bccn rcported which q~l~nt;7e a variable size input vector with a fixed size code vector. In a first approach, the input vector is transformcd to a fixed sizc vector which is then conventionally vector quantized. ~n inverse transform of the quantized fixed size vector yields thc rccovered quantized vector. Transformation techniques ~vhich have been used include lincar interpolation, band limited interpolation? all pole modelling and non-square transformation. This approach however produces an overall distortion which is the summation of the vector quantrzation noise and a component which is introduced by thc transformation process. In a second known approach, a variable input vcctor is directly quantized with a fised size code vector. This approach is based on selecting only a limited number of elements from each codebook vector to form a distortion measure between a codebook vector and an input vector. Such a quantrzation approach avoids the transformation distortion of thc alternative tcchnique mentioned abovc and results in an overall distortion that is equal to the vector quantization noise, but this is significant.
It is an obiect of the present invention to provide an improved variable sized spectral vector quantization scheme.
According to a fifth aspect of the present invention, there is provided a speecb synthesis system in which a variablc size input vcctor of coeMcients to be -CA 022~9374 1998-12-29 .I ' W O 98/01~48 PCT/GB97/0 ~ 1 transmitted to a receiver for tlle reconstruction of a spcech signal is vector quantized using a codebook defined b~ vectors of fixed size, the codebook veetors of fi~ed size are obtained from variable size training vectors and an interpolation teeh~ique which is an integral part of the codebook generation process, codebook vectors are compared to the variable sized input vector using the interpolation proeess, and an index assoeiated with the codebook entry with the smallest differenee from the comparison is transmitted, the index being used to address a further codebook at the reeeiver and thereby derive an assoeiated fixed size eodebook veetor, and the interpolation proeess being used to reeover from~the derived fixed sized codebook vector an approximation of the variable sized input vector.
The invention is applieable in particular to pitch synchronous low bit rate coders of the type deseribed in this doeument and takes advantage of the underlying prineiple of sueh eoders which means that the slu.lpe of the magnitude speetrum is represented by a relatively small number of equallv spaeed samples.
Preferably the interpolation proeess is linear. For an input veetor of given dimension, the interpolation proeess is applied to produce from the eodebook veetors a set of veetors of that given dimension. A distortion measure is then derived to eompare the interpolated set of vectors and the input veetor and the eodebook vector whieh yiel~ls the minimum distortion is seleeted.
Preferably the dimension of the input vectors is reduced by taking into aeeount only the harmonic amplitudes with the input brandwidth range, for ' W O98/01848 PCT/GB97/0~1 example O to 3.4kHz. rrcferably thc r~m~ining amplitudes i.e. in the region of 3.4kHz to 4 kHz are sct to a constant value. I'referably, thc constant value is equal to the mean value of thc quantized amplitudes.
Amplitude vectors obtained from ad3acent rcsidual frames exhibit signific~nt amounts of redundancy which can be rcmoved by means of backward prediction. Thc backwarcl prediction may be performed on a harmonic basis such that the amplitude value of cach harmonic of one framc is predicted from the amplitude valuc of thc samc harmonic in the previous frame or frames. A
fised linear predictor may be incorporated in the system, togethcr with Ihean removal and gain shape quantization processes which o~erate on a r~nltin~
error m~gnitnde vector.
Although the above describe~ variable sized vector quantization scheme provides advantageous characteristics, and in particular provides for good perceived signal quality at a bit rate of for example 2.4Kbits/sec, in some enYironments a lower bit rate w ould bc highly desirable even at the loss of some quality. It would be possible for example to rcly upon a single value representation and quantization stratcgy on the assumption that the magnitnde SpC.I~ of the pitch segment in the residual domain has an approximately flat shape. Unfortunately systems based on this assumption havc a rather poor decoded speech quality.
It is an object of thc present invention to overcome the above limit~ion in lower bit rate systems.

' ! ' W~ 98/01~48 PCT/GB97~0 ~ 1 Aecording to a sixth aspect of the present invention, there is provided a speeeh synthesis system in whieh a speech signal is divided into a series of frames, each frame is eonvcrted into a eoded signal ineluding an estimated piteh period, an estimate of the energy of a speeeh segment the duration of whieh is a funetion of the estimated pitch period, and LPC filter eoefficients defining an LPC speetral envelope, and a speeeh signal of related power to the power of the input sl)eech signal is reeonstructed hy generating an excitation signal using speetral amplitudes which are defined from a modified LPC speetral envelope sampled at the harmonie frequeneies defined by the piteh period.
Thus, although a single value is used to represent the spectral envelope of - the exeitation signal, the excitation speetral envelope is shaped aecording to the LPC speetral envelope. The reslllt is a system which is ca~able of delivering high quàlity speeeh at l.~Kbitslsee. The invention is based on the observation that some of the speeeh spc-ll uu. resonance and anti-resonance information is also present in the residual magnitude spectrum, sinee LPC inverse filtering eannot produee a residual signal of absolutely flat magnitude spectrum. As a consequence, the LPC residual signal is itself highly intelligible.
The magnitude values may be obtained by spectrally sampling a modified LPC synthesis filter charaeteristic at the harmonic loeations related to the pitch period. The modified LPC synthesis filter may have reduccd feed back gain and a frequeney response which eonsists of equalised resonant peaks, the loeations of which are elose to thc L~C synthesis resonant loeations. The value of the feed CA 022~9374 1998-12-29 W 098/01848 PCTIGB~71~1 back gain may be controlled by thc l)ertormancc of t~le L rc modcl such that it is for example proportional to thc normalised LPC prediction error. Thc energy of the reproduced speech signal may be equal to thc energy of thc original speech waveform.
It is well known that in prototypc interpolation coding speech synthcsis systems there are often substantial similarities bctween the prototypes of adjacent frames in thc residual excitation sigr~lc- This has been used in various systems to improvc perceived spcech quality b~ ensuring that therc is a smooth evolution of the speech signal ovcr time.
It is an object of the present invention to proYide an improved speech synthesis system in which thc excitation and vocal tract dynamics are sl-bsf~n~-~lly l7reserved in the recovercd speech signal.
According to a seventh aspect of the present invention, there is provided a speech synthesis system in which .- speech signal is dividcd into a series of frames, each fra-m-e is converted into a coded signal including LPC filter coefficients and at lcast one parameter associated with a pitch segment m~gni~nde, and the speech signal is rcconstructed by generating two excitation signals in respect of each frame, each pair of excitation signals comprising a first excitation signal generated on the basis of the pitch segment magnitude parameter or parameters of onc framc and a second excitation signal generated on the basis of the pitch segment magnitudc parameter or parameters of a second frame which follows and is adjacent to the said one frame, applying the CA 022~9374 l998-l2-29 ' ' ! W~ 98/01~48 PCT/GB97/0~1 first excitation signal to a first LPC filtcr thc characteristics of which are determined by the LrC filter coefficicnts of the said onc frame and applying the second excitation signal to a sccond LPC filter tllc characteristics of which are determined by the Ll'C filtcr cocfficients of the said sccond frame, and weighting and combining the outputs of thc first and sccond LPC filters to produce one frame of a synthesised speech signal.
Preferably the first and sccond excitation signals include the same ~hase function and different phase contributions from thc two LrC filtcrs involved in the above double synthesis process. This reduces the degree of pitch period'icity in the recovered ci~r~l~ This and thc combination of the first and second LPC
filter outputs ensures an effective smooth evolution of the speech spectral envelope on a sample by samplc basis.
Preferably the outputs of the first and second LPC filters are weighted by half a window function such as a Hamming window such that the magnitude of the output of the first filter is decreasing with time and thc magnitude of the output of the second filter is increasing with timc.
According to an eighth aspect of the present invention, there is provided a speech coding system which operates on a frame by frame basis, and in which information is transmitted which represents each frame as either voiced or unvoiced and, for each voiced frame, represents that frame by a pitch period value, quantized magnitude spectral information, and LPC filter coefficients, the received pitch period valuc magnitude spectral information being used to PCTIGB97/0 ~ 1 generate residual signals at thc reccivcr w11ich al c a~ lied to LrC speeeh synthesis filters the eharacteristies of ~vhich are determined by the transmitted filter eoeffieients, wherein eaeh residual signal is synthesised aceording to a sinusoidal mixed exeitation synthesis proeess, and a recovered speech signal is derived from the residual signals.

CA 022~9374 1998-12-29 i W O98/01848 PCT/GB97/0~1 Embo~liments of the present invention will now be deseribed, by way of example, with referenee to the accomp.lnying dra~vings, in whieh:
Figure 1 is a general block diagram of the encoding proeess in aeeordanee with the present invention;
Figure 2 illustrates the relationship between eoding and matrix quantisation frames;
Figure 3 is a general bloel; diagram of the deeoding proeess;
Figure 4 is a bloek diagram of the exeitation svnthesis proeess;
Figure 5 is a sehematie diagram of the overlap and add proeess;
Figure 6 is a sehematie diagram of the calculation of an instantaneous sealing faetor;
Figure 7 is a bloek diagram of the overall voiced/unvoieed elassification alld piteh estimation process;
Figure 8 is a block diagram of fhe pitch estimation proeess;
Figure 9 is a schematic diagram of two speech segments which participate in the ealeulation of a crosscorrelation function value;

Figure 10 is a schematic diagram of speeeh segments used in the caleulation of the erosseorrelation funetion value;
Figure 1 I represents the value allocated to a parameter used in the calculation of the crosscorrelation funetion value for different delays;
Figure 12 is a block diagram of the process used for ealculated the erosseorrelating function and the seleetion of its peaks;
Figure 13 is a flow ehart oi a pitch estimation algorithm;

-W O 98101848 PCT/GB97/0~1 Figure 14 is a flow chart of a proeedure uscd in tllc pitcll estimation proeess;
Figurc 15 is a flow chart oi a furthcr r~roccdure used in thc piteh estimation proeess;
Figure 16 is a flow cllart of a furthcr proccdure used in the pitch estimation process.
Figure 17 is a flow chart of a thrcshold valuc sclcction procedure;
Figure 18 is a flow chart of the voiced/unvoiccd classification process;
Figure 19 is a schematic diagram of thc voiced/unvoiced classifieation process with respeet to parametcrs generatcd during the piteh ~s~imati-ln proeess;
Figure 20 is a flow chart of the procedure used fo dctcrmine offset values;
Figure 21 is a flow chart of the lliteh estimation algorithm;
Figure 22 is a flow chart of a proccdurc used to imposc constraints on output pitch estim~te-s to ensurc smooth evolution of piteh values with time;
Figures 23, 24 and 25 reprcsent different portions of a ~low chart of a pitch post proeessing proeedure;
Figure 26 is a general bloek diagram of the LPC analysis and LPC
quantisation proeess;
Figure 27 is a general flow chart of a strongly or weakly voiecd elassifieation proeess;

i ' W O 98101848 PCT/GB97/0~1 Figurc 28 is a flo-v chart of tlle pl-ocedure rcsponsible for the stronglylweakly voiced classification Figure 29 reprcsents a specch wavcform obtaincd from a particular speech utterance;
Figurc 30 shows frcqucncy tracks obtained for the speech utterance of Figurc 29;
Figure 31 shows to a largcr scale a portion of Figure 30 and rcpresents the differcnce bctween strongly and ~cakly voiced classifications;
Figure 32 shows .- magnitudc spcctrum of a particular speech seg~nent and the corresponding LrC spectral envelopc and thc norm~ ed short term magnitude spectra of the corrcsponding residual segment, excitation segment obtained using a binary excitation modcl and an excitation segment obtained using the stronglytweakly voiced model;
Figure 33 is a general block diagram of a system for reprcsenting and quanffsing magnitude information;
Figure 34 is a block diagram of an adaptive quantiser shown in Figure 33;
Figure 35 is a general block diagram of a quantisation process;
Figure 36 is a general block cliagram of a differential variable size spectral vector quantiser; and Figure 37 represents the hierarchical structure of a mean gain shape quantiser W O98/018~8 PCT/GB97/0~1 A system in accordance with the present invention is described below, firstly in general terms and then in greater detail. The system operates on an LPC residual signal on a frarne by frame basis.

Speech is synthesised using the following general expression:
s(i ) = ~ A k (i) COS(~ k (i ) + lli) I ) ( l ) k=0 where i is the sarnpling instant and Ak(i) ~ sellts the amplitude value of the kth cosine term COS(~)k (i)) (Wi~l ~)* (i) = ~ k (i) + ~ ) as a ~unction of i In voiced speech K depends on the pitch frequency of the signal.

A voiced/unvoiced classification process allows the coding of voiced and unvoiced frames to be handled in different ways. Unvoiced frarnes are modelled in terms of an RMS value and a ra~ldom time series. In voiced frames a pitch period esfim~t~ is obtained and used to define a pitch se~m~nf which is centred at the middle of the frame. Pitch segm~t~ from ~7rl3~nt frames are DFT transforrned and only the resulting pitch egm~nt m~gnitll~le information is coded and L,~ l Furthermore, pitch segment m~nit~ sarnples are classified asstrongly or weaWy voiced. Thus in addition to voiced/unvoiced information, the system ...;t~ for every voiced frame the pitch period value, the magnitude spectral inforrnation of the pitch segm~nt~ the stronglweak voiced classification of the pitch magnitude spectral values, and tne LPC coefficient. Thus, the inforrnation which is trz n~mitted for every voiced frarne is, in addition to voiced/unvoiced inforrnation, the pitch period value, the m~gnitll~le spectral inrc ~ ion of the pitch segment, and the LPC filter coefficients.

At the receiver a synthesis process, that includes interpolation, is used to reconstruct the waveforrn behveen the middle points of the current (n+l)th and previous nth ~ames. The basic synthesis equation for the residual signal is:

PCT/GB97/0~I
: W O 98101~48 Res(i) - ~ MG~ cos~shase j (i)) (2) ,j.o where MG~ are decoded pitch segment magnitude values and phasej~i) is calculated from the integral of the linearly interpolated in~t~nt~ncous harmonic frequencies ~j(i). K is the largest value of j for which ~Din(i)S7~.

In the transitions from unvoiced to voiced, the initial phase for each h~rmonic is set to zero.
Phase continui~ is preserved across the boundaries of successive interpolation intervals.

The synthesis process is perforrned twice however, once using the magnitude spectral values M~jn I of the pitch segment derived from the current (n+l)th frame and again using the m~nih-cle values MGjn of the pitch segrnent derived in the previous nth frarne. The phase function phasei(i) in each case remains the same. The resulting residual signals Resn(i) and Resn~l(i) are used as inputs to corresponding LPC synthesis filters calculated for the nth and (n+l)th speech ~ames. The two LPC synthe~i~eci speech waveforrns are then weighted by Wn+l(i) and Wn(i) to yield the recovered speech signal.

Thus the overall synthesis process, for successive voiced frames. can be described by:
S(i) = W" (i)~ Hn (~ j (i))MG;' cosjphasell (i) + (p " (~3 '; (i))]
K (3 ) +W"+,(i)~Hn+'(cl) j(i))~G;'~' cos[phase'~(i)+(p"+~(~',(i)).
. J_o where Hn (/0 ~ (i)) is the frequency response of the nth frarne LPC synthesis filter calculated, at the ~jn(i) harmonic frequency function at the ith instant. (p"(~3',(i)) is the associated phase response of this filter. cojn(i) and phasqn(i) are the frequency and phase functions defined for the sarnpling instants i, with i covering the middle of the nth frame to the middle ofthe (n~l)th frarne se~m~nt~ K is the largest value of j for which o)j"(i)ST~.

CA 02259374 1998-12-i9 W 098/01848 PCT/GB97/018~1 The above speech synthesis process introduces two "phasc dispersion" terms i.e ~ J)) and ~pn~ of (f)) which effectively reduce the degree of pitch periodicity in the recovered signal. In addition, this "double synthesis" arrangement followed by an overlap-add process ensures an effective smooth evolution of tlle speech spectral envelope (LPC) on a sample by sample basis.

The LPC excitation signal is based on a "mixed" excitation model which allows for the ap~lol.Liate mixing of periodic and random excitation components in voiced frames on a frequency-band basis. This is achieved by operating tlle system such that the magnitude spectrum of the residual signal is e~min~-l, and applying a peak-picking process, near 'the C'~i resonant freqllenrje~, to detect possible dominant spectral peaks. A peak associated with a frequency ~j inrli~ tes a high degree of voicing (lG~lGse~lted by hvj=1) for that harmonic. The absence of an ~ eent spectral peak, on the other hand, indicates a certain degree of randomness (Ic~l~,3Gllted by hvj=O). When hvj=l (to indicate "strong" voicing) the contribution of the jth harmonic to the synthesis process is MGf cos(phasel (i)) However, when hvj=O (to indicate "weak" voicing) the frequency of the jth harmonic is slightly dithered, its l~lagnilllde MGj is reduced to (MG, / ~) and random cosine terrns are added symmetrically alongside the jth harmonic ~ j. The terms "strong" and "wealc" are used in this sense below. The nurnber NRS of these random terms is Nli 5' = 2 x (d o (4) 4~ x ('SO/fs) where r 1 indicates rounding off to the next larger integer value. Furthermore. the NRS
random components are spaced at 50 Hz intervals symmetrically about ~ ~ j being located in the middle of such a 50 Hz interval. The amplitudes of tlle NRS random components are set to (MG, / J2 x NRS) Their initial phases are selected randomly from the ~-7r, +1r] region at pitch period intervals.

? wo 98tO1848 The hvj information must be tr~ncmitted to be available at the receiver and, in order to reduce the bit rate allocated to hvj, the bandwidth of the input signal is divided into a nurnber of fixed size bands BDk and a '~strongly" or "weakly" voiced flag Bhv~; is ~ssi~ne~l for each band. In a "strongly" voiced band, a highly periodic signal is reproduced. In a "weakly"
voiced band, a signal which combines both periodic and aperiodic components is required.
These bands are clacsified as skongly voiced (Bhvk=l) or weakly voiced (Bhvk=0) using a majority decision rule approach on the hvj classification values of the harrnonics co~ contained ~,vithin each fre~luency band.

Further restrictions can be imposed on the stron~ly/weakly voiced profiles resulting frorn the classification of bands. For example, the first A bands may always be strongly voiced i.e.
hvj=1 for BDk with k=1,2,...,~, and ~ being a variable. The rem~ining spectral bands can be strongly or weakly voiced.

Figure 1 s~hPm~tically illustrates processes operated by the system encoder These processes are referred to in Figure 1 as Processes 1 to V~I and these terms are used throughout this doc--m~nt ~igure 2 represents the relationship between analysis/coding frame sizes employed. These are M sarnples per coding frarne, e.g. 160 sarnples per frame, and k frarnes are analysed in a block, for exarnple k=4. This block size is used for matrix qn~nti~tion. A
speec~l signal is input and processes 1, III, IV. VI AND VII produce outputs for tr~ncm-scion.

~ccllmin~ that the first Matrix Q-l~nti~tion analysis frarne (MQA) of kxM sarnples is available, each of the k coding frames within the MQA is classified as voiced or unvoiced (V~,) using, Process L. A pitch estimation part of Process I provides a pitch period value P, only when a coding ~arne is voiced.

PCTIGB97101~1 Process 11 operates in parallel on the input speech samples and estim~3tPs p LPC filter coefficients a (for example p=l 0) every L sarnples (L is a multiple of M i.e. L-mxM, and m rnay be equal to for example 2). ln addition, Wm is an integer and ,~esents the frarne ~lim~on~ion of the matrix q-l~nti7Pr employed in Process rII. Tllus the LT'C filter coefficients are 4u~t~ cl, using Process III and transmitted. The qll~nti7f d coefficients â are used to derive a residual signal l~n(;), When an input coding frame is unvoiced, the Energy E" of the residual obtained for this frame is calculated (Process VII). ~f~;~ is then q-l~nti7Ptl and transmitted.

When the nth coding frarne is classified as voiced, a segm~nt of Pn residual sarnples is obtained (Pn is the pitch period value associated with the nth frarne). This segment is centred in the rniddle of the frarne. ~he selected Pn sarnples are DFT transformed (Process V) to yield r(pl~ + ~) / 21 spectral rn~nitllde values MGI, ~ o<i<r( P" + 1) / 21, and r(P" ~ 1) / 21 phase values. The phase information is neglected. The magnitude information is coded (using Process VI) and t~ A In addition a segment of 20 msecs, which is centred in the middle of the nth coding frame, is obtained from the residual signal Rn(i). This is input to Process IV, together with Pn to provide the strongly/weakly voiced classification parameters hvj" of the harmonics cl~jn. Process IV produces q~ ti7~-d Bhv inforrnation, which for voiced frarnes is multiplexed and tr~n~mitt~ d to the receiver together with the voicedfunvoiced decision Vn, the pitch period Pn~ the 4, .~ d LPC coefficients â of the corresponding LPC
frarne, and the m~ t~ e values MG,'. In unvoiced frames only the~/~;; quantized value and the 4.~ 1 LPC filter coef~lcients â are transmitted.

Figure 3 s~h~m~fically illustrates processes operated by the system decoder. In general terrns, given the received pararneters of the nth coding frame and those of the previous (n-l)th coding frarne, the decoder synthesises a speech signal Sn(i) that extends from the middle of ' ' PCT/GB971018~1 ~' ? W O 98/01848 the (n- I )th frame to the middle of the nth frame. This syntllesis process involves the generation in parallel of two excitation signals Resn(i) and ReSn~ which are used to drive two independent LPC synthesis filters I I .4" (z) and 1/ ~4"-1 (z) the coefficients of which are derived from the transmitted qu~nfi~d coefficients a . The outputs Xn(i) and Xn l(i) of these synthesis filters are weighted and added to provide a speech segment which is then post filtered to yield the recovered speech Sn(i). Tl1e excitation synthesis process used in both paths of Figure 3 is shown in more detail in Figure 4.

The process cornmences by considering the voiced/unvoiced status V~, where k is e~ual to n or n-l, (see rigure 4). When the frame is unvoiced i.e. Vk=0, a ~ Csi~n random number generator RG(0, 1 ) of zero mean and unit variance, provides a time series which is subsequently scaled by the ~/~ vaiue received for this frame. This is effectively the required:
Res~ (i) = J~x RG(O,I) (S) signal wnich is then presented to the corresponding LPC synthesis filter 1/ ,4" (z), k=n or n-1. Performance could be increased if the ~ value was calculated, q~ rlt;7~n and tr~n~mittcd every Smsecs. Thus, provided that bits are available when coding unvoiced speech, four ~ ,=0,..,3, values are kansmitted for every unvoiced frame of 20msecs duration (160 sarnples).

In the case where Vk=l, the Resk(i) excitation signal is defined as the summation of a "harmonic" Reskh(i) component and a "random" Reskr(i) component. The top path of the Vk=1 pa~t of the synthesis in ~igure 4, which provides the harmonic component of this mixed excitation model, calculates always the instantaneous harrnonic frequency function ~j"(i) which is associated with the interpolation interval that is defined between the middle points of the nth and (n-l)th frarnes. (i.e. this action is independent of the value of k). Thus, when CA 022~i9374 1998 -12 - 29 PCT/GB97/0 18~ 1 WO 98/01~48 decoding the nth frarne, o~jn(i) is calculated using the pitch frequencies fjl ~', fj2 n and linear interpolation i.e.

2,11 ~ n (i) = 27r. J i J i i + 27tf j2~ (6) with O ~ j < r(P",~,~ + 1) 121, O<i<M and P",~,~ = max[P" . P", The frequencies, f~l'n and ~2~n are defined as follows:
IJ When both the nth and (n-l)th coding frames are voiced i e V"=l and Vn l=l, then the pitch frequencies are ectim~te-l as follows:
a) lf ¦Pn--Pn ~¦ < 0-2 x (P" ~ Pn l) ~ ) which means that the pitch values of the nth and (n-l)th coding frames are rather similar, then:
f~lJ~ = j p + (I--hv~' )x RU(--a,+a) (8) f 2JI f,''"-'+ jh if (T/", = V, ~ ND ~P", - P" ~¦ ~ 0~2(P", + P"_t)) f ~ otherwise The f~ value is calculated during the decoding process of the previous (n-l)th coding fi~me. hvjn is the strongly/weakly voiced classification (0~ or 1) of the jth harmonic ~jn. Pn and Pn l are the received pitch estimates from the n and n-l frames.
RU(-a,+a) indicates the output of a random nurnber generator with uniform pdf within the -a to +a raDge. (a=0.00375) b) if IP.--Pn_,¦ > ().2 x (Pn + P", ) ~10) then f,'J'--j( p --b) + (I--hv'j')x RU(-a~+a) (11) and fJ2.11 = fi ~n-l + b x j where b is defined as:
o 7 ( P ~ P , ~
b = 2 x sg~ ~ --f j ) (12) I t PCTfGB97/0~1 ' i W O 98/01~48 Notice that in case (b) which applies for significantly different Pn and P,t I pitch estimates, equations 11 and 12 ensure that tl-e rate of change of the ~)j (i) function is restricted to ( j ~~ P + Pl - I ) ) /M
II) When one of the two coding frames (i.e n, n-l) is unvoiced, one of the following two definitions is app~icable:
a) for Vn l=O and Vn=
J2,1= 1 j ~)Sj ~"+~

and fjl~n is given by Equation (8).
b) for Vn ~=l and Vn=~
ij2~niS set to the ijl~~-l value, which has been calculated during the decoding process of the previous (n- l )th coding frame and ~ 2~11 Ciiven ct)jn(i) the in~t~nt~neous ~unction phasej''(i) is calculated by:
phaseJ' = 2J~ ( ~ 2~Mf J )i 2 + 2'rcf j2'"i + pha*e 'j'- ' ( M) for O < j < ~ P,l", +
and O<i<M

Furthermore, the "harmonic" component Re Sk' (i) of tlle residual signal is given by:
~'2 1-Res,~'(i) = ~ Cj(i) x M~j (hvJ: ) x cos[phase';(i)] O < i < M (14) j~o where k=n or n- 1, O if C~ 'i (i) > 11 C;(t)=~I if(o'j(i) <7~

(MG~ ) forhv". = O d I < j < Ir 1 MG jl (hvk ) = ~ MGI~ for hv~ 2 O other~ise, including j = O
arld CA 02259374 1998-12-i9 Mâ, i=~,...,l(P,~ +1)/2~-1 are the received magnitude values of the "kth" coding frarne, with k=n or k=n-l .

W O98/01848 PCT/G~97/0~1 The second path of the Vk=l case in Figure 4 provides tl1e random excitation Componènt Res* (i). In particular, given the recovered strongly/weakly voiced classification values hvjk, the system calculates for those harmonics with hY, =O tlle number of random sinusoidal NRS
components, which are used to randomise the corresponding harmonic. Tllisis:
NRS 2 ~i) (15) where fs is the samp~ing frequency. Notice that the NRS random sinusoidal components are located symmetrically about the corresponding harmonic c~ ~ and they are spaced 50 ~Iz apart.

The i.,~ rous frequency of the qth random component, q=O,l,...,NRS-l, for the jth h~rrnonic ~ ~ is calculated by:
(i) = (~ ) (i) + 2~ x (25/fs) + (4 --(NRS/2))x 27c x (50/ f~ f~r ~ ~ i < ~ 2 1 (16) and O'i<M
The associated phase value is:
PhJ,g (i) = J Q jR ( ) i2 + i<i) k (03 + for O < j < ~ 2 1 (17) and OSiSM

where r~ jq - RU(7l,-7r) . In addition, the Ph~"(i) function is randomised at pitch intervals (i.e. when the phase of the fundamental harmonic component is a multiple of 2~, i.e.
mod~hase, (i), 21~ )= O ).

Given the Phf,q (i), the random excitation component Res~;r(i) is calculated as follows:
~r.""t~
2 ¦ NRS--I
Re Sk (i) = ~ ~ C~ q (i'J x MGk q (hvk ) x cos(Ph"~, (i)) O S i < M (18 ~--o q~o where _ _ , (M~ ;~)for hv~j = O an 1 c ~ r P~ + ~1 MGJq(hvJ )=< O for h~ c1 2 o o~heru~ise, including j = O

C~ i) > 7~

Thus for ~k=l voiced coding frames, the mixed excitation residual is forrned as: Res, (i) = Resk (i) + Resk (i) (19) Notice that when Vk=O, instead of using E~quation 5. the random excitation signal Resk(i) can be g~ ."t~cl by the sllmm~tion of random cosines located 50 Hz apart, where their phase is r~nAomi~eA every ~ samples, and ~<M, i.e 8(~ 1--Resk ~i) a ~¦ 40 cos(27~Cfs/~O)r ~ x ~ ) x RU(--~7~)) where (20) ~,--û,1,2, ..,and O<i c M and is defined so as to ensure that the phase of the cos terrns is randornised every ~ samples across frame bolln-7~ri~ The resulting Resn(i) and Resn l(i) excitation sequences, see Figure 4, are processed by the c~ s~onding 1/ A" (z) and 1 ~ A"-, (z) I,PC synthesis filters. When coding the next (n+l)th fi~ne, 1/ A"_,(z) becomes 1 /An(z) (ine,lll-lin~ the memory) and 11 An (Z) becornf~s 1/ A"+~ (z) with the memory of 1 / An (Z) . This is valid in all cases except during an unvoiced to voiced transition, where the memory of the 1/ A"" (z) filter is set to zero. The coeffic;ents of the 1 / An (Z) and 1/ A"_, (z) synthesis filters are calculated directly ~om the nth aIld (n-l)th coding speech frames respectively, when the LPC analysis frame size L is equal to M samples. However, when L~M (usually L>M~ linear interpolation is used on the filter coefficients (defined every L sampLes) so that the transfer function of the synthesis filter is updated every M sarnples.

The output signals of these filters, denoted as X" l (i) and Xn(i), are weighted, overlapped and added as sch~.m~tically illustrated in Figure S to yield X" (i) i.e:

Xn (i) ~1-1 (i) X~I l (i) + Wn (i) Xn (i ) where 054--0.46cos(2M I i) for O < i < M when Y" = Y"
Wn (i) = ~ , O for O < i c 0.25M
~ S _ 05 ~ --0 25M) for 0.25M C i c 0.75M ~ when Yn ~ Y"_ for 0.75M < i < M
(21) and 054--0 46 cos( 2~ (f + M 05)) for O < i < M when Yn = Yn-Wn l (f) =
for O < i c O.25M
05 + 05 s( i--0 25M) for 0.25M C i < 0.75M ~ when YN ;~ Y~t-~
O for0.75MCi< M
(22) XN(i) is then filtered via a PF(z) post filter and a high pass filter HP(z) to yield the speech segment S'n(i). PF(z) is the conventional post filter:

PF(z) = Ar' ~ ) (1--~-' ) (23J
with b=0.5, c=0.8 and 11=05k:;'.~,' is the first reflection coefficient of the nth coding ~ame. HP(z) is defined as:

W O 98/01848PCT/GB97/0 ~ 1 HP( ) b~--c~z~~ (24) 1--a~z with bl=cl=0.9807 and al=0.96148 1.

In order to ensure that the energy of the recovered S(i) signal is preserved, as colllpal~,;l to that of the X(i) sequence, a scaling factor SC is calculated every LPC frame of L s~mples SC, = ~ (25) where: E,=~X~ and E,=~S,(i) 2 / - o ;..o SCI is ~c~oei~t~-,A with the middle of the 1th LPC frarne as illustrated in Figure 6. The''filtered sarnples from the middle of the (1-1 )th frarne to the middle of the 1th frarne are then multiplied by SCl(i) to yield the final output of the system~ Sl(i)=SCl(i)xS'l(i) where:

SC,~i)=sc,w,(i3+sc"W"(i) O<i<L (26) and W, (i) = 05--05cos(~r L 1) ~ < i < L

W,_~(i)=05~05cos(~ L 1) O<i~L

The scaling process introduces an extra half LPC frarne delay into the coding-decoding process.

The above described energy scaling procedure operates on an LPC frarne basis in contrast to bo~ the cleco-ling and P~(z), HP(z) filtering procedures which operate on the basis of a fr,arne of M samples.

Details of the coding processes le~esellted in Figure 1 will now be described.

W O 98/01$48 PCT/GB97/0~1 4l Process I derives a voiced/unvoiced (V/UV) classification V" for the nth input coding frarne and also assigns a pitch ~st;m~te Pn to the middle sarnple Ml1 of this frame. This process is illustrated in Figure 7.

The V/UV snd pitch est~mation analysis frarne is centred at the middle Mn+l of the (n+l)th coding frarne with 237 sarnples on either side. The signal x(i) in the above analysis frarne is low pass filtered with a cut off frequency fC=1.45KHz and the resulting (-147, 147) samples centred about Mn+l are used in a pitch ~stim~tion algorithrn, which yields an es~im~ PM ~,-The pitch estim~tion algorithm is illustrated in Figure 8, where P represents the output of the pitch estirnation process. The 294 input sarnples are used to calculate a crosscorrelation function CR(d), where d is shown in Figure 9 and 20Sd<147. Figure 9 shows the two speech s~ ; which participate in the calculation of the crosscorrelation function value at "d"
delay. In particular, for a given value of d~ the crosscorrelation function pd(j) iS calculated forthe se~,.ff~ xL~d, ~xR}d,as:

~ ((X L (i)--X L XX ~d~ x '/ )) pd ( j) _ 1-0 (2,7) ~ kd (i)--XL j ~ ~ (Xl. (i) Xr. ) where:
xLd (i)=x(M,2+l-d+j~i), xRd (i)=x(Mn+l+j+i), for O<i'd-J-l, j=O,l,...,f(d) (Figure 10 S~l~f ~ y lG~lCSelll:; the X ~ and XR~ speech segments used in the c~lculsltion of the value CR(d) ~nd the non lillear relationship between d and f(d) is given in Figure I I XL and xd r~lesellt the mean value of the {x~ ' d and {xR}d sequences respectively.

The algorithm then selects max~pd(j)] and defines CR(d)= max ~Pd(j)],20<dS147.
In addition to CR(d), the box in Figure 8 labelled "Calculation of CR function and selection of its peaks", whose ~etz3iled diagram is sho~,vn in Figure 12, ~rovides also the locations loc(k) W O98/01848 PCT/GB9710~1 of the peaks of the CR(d) fi~nctioll, where k=1,2,....Np and Np is the number of peaks in a C~(d) function.

Figure 12 is a block diagrarIl of the process involving the calculation of the CR function and the selection of its peaks. As illustrated, given CR(d), a thresllold th(d) is determined as:

th(d)=CR(dm~ b--(d--d"I~)xa--c (28) where c=0.08when (V,'= 1),4Np~.d,"~X-P,'¦< 0.15x(d~ X +P" )pR(~"_~ =I)]
~ ND(d > 0.875 x ~ )ANr~(d < 1.125 x P,') or c=0 elsewhere.
and c~ n.et~nt~ a and b are defined as:

b 0.025 0.04 0.05 a 0.0005 0.0005 0.(1006 Vn 1 1 0 Vn l 1 0 1/0 dm~5 is equal to the value of d for which CR(d) is maximised to CRA, " . Using this threshold the CR(d) fiulctiorl is clipped to CRL(d). i.e.
L(d) =O for C~R(d)sth(d) CRs,(dj=CR(d) otherwise.
CRI(d) co,~ s segmenS~ Gs s=1,2,3.. , of positive values separated by Go runs of zero values. The algorithm examines the length of tlle Go runs which exist between successive Gs se~ments (i.e. Gs and Gs+1), and when Go < 17, then the G~ segment with the max CR~,(d) value is kept. This procedure yields CR, (d ), which is then examined by the following ~'peak picking" procedure. In particular those CR, (d) values are selected for which:
CR~(d)>CR,(d-l) and CR,(d)~CR,(d+l) However certain peaks can be rejected if:
CRL (loc(k)) < CRL (loc(k ~ 1)) x 0.9 This ensures ~at the final CR,(loc(k)) k=l,...,Np does not contain spurious low level CR, (d) pealcs. The locations d of the above defined CRL(d) peaks are given by loc(k) k= 1,2, . . .,Np.

CR(d) and loc(k) are used as inputs to the following Modified High ~esolution Pitch F..ctim~tion algorithm (MHRPE) shown in Figure 8, whose output is PMn~,. The flow~ of this MHRPE procedure is shown in Figure 13, where P is initialised with 0 and, at the end, the ~stim~ted P is tne requested PMn~,. In Figure 13 the main pitch estimation procedure is based on a Least Squares Error (LSE) algorithm which is defined as follows:
For each possible pitch value j in the range from 21 to 147 with an increment of 0.1 x j. i.e.
j ~ {21,~ 7,30,33,36,40,44,48,53,58,64,70,77,84,92,101,111,122,134} . (Thus 21 iterations are ~c~rc,~ ed.) 13 Folm the multiplication factor vector:
Uj = [ 1 Ioc]

2) Reject possible pitch j and go back to (I) if a) the same f lçm~nt occurs in Uj twice.
b) the elements of Uj have as a common factor a prime number.

3) Forrn the following error quantity Ej =loc loc--2pjuj IOC+ pjUj Uj where --T_ loc Uj Pj _ - T

4) Select the Pjs value for which the associated Error quantity Ejs is mtnimllm (i e. j~.Ej,~ < ~j ~j ~ {21,23,...134}). Set P=pjs.

_ The next two general conditions "Reject Higllest Delay" loc(Np) and "Reject Lowest Delay~
loc(l) are included in order to reject false pitch, "double" or "half" values and in general to provide constraints in the pitch estimates of the system. The "Re3ect Highest Delay" condition involves 3 con~it~-iJ if P=0 then re3ect loc(Np).
ii) if loc(Np) ~100 then find the local m~imllm CR(d~m) in CR(d) at the vicinity of the ~stim~te~l pitch P (i.e 0.8xP to 1.2xP) and compare this with th(d~m), which is deterrnined as in Equation 28 Reject loc(Np) when CR(d"n)<th(d,m)-0.02.
iii) If the error Ej5 of the LSE algorithm is larger than 50 and u,S (Np)=Np with Np>2 then reject loc(Np).
The flowchart of this is given in Figure 14.

The "ReJect Lowest Delay" general condition, whose flowchart is given in Figure 15, rejects loc(l) when the following three constraints are ~imllit~ ously satisfied:
i) The density of detection of the peaks of the correlation coefficient function is less than or e~l to 0.75. i.e.
u" ( Np) ii) If the loc?tion of the first peak is neglected (i,e, Ioc(1)), then the r~m~inin~ locations exhibit a common factor.
iti) The value of the corrolation coefficient function at the locations of the mi.~sin~ peaks is relatively small compared to ad~acent ~ tected peaks, i,e, If Upnk-upn~k)>l, for k=l "..Np. then for i=upn(k)+ I : Upn(k+ l )- I
a) find local maximum CR(d"n) in the range from (i-O l)xloc(l) to (i+O.l)x loc(l ).
b) if CR(d~m) <0 97xCR(upn(k)) then Reject I ,owest nelay, END.
else Continuc ! ~. WO 98/01848 PCTIGB97/0~1 This concludes the pitch estim, tion procedure of Figure 7 whose output is PMn~,. As is also illustrated in Figure 7 however, in parallel to the pitch c~tim,.tion, Process I obtains 160 sarnples centred at the middle of the Mn+l coding frarne, removes their mean value, and then calculates R0, Rl and the average RaV of the energies of the previous K non-silence coding frames. K is fixed to 50 for the first 50 non-silence coding frames, increases from 50 to 100 with the next 50 non-silence coding frarnes, and then remains constant at the value of 100.
The flowchart of the procedure that calculates Ray7 Rl, R0 and updates the Ra7, buffer is shown in Figure 16, where "Count" ~~pl~scnl~. the nurnber of non-silence speech frarnes, and "~t" denotes increase by one. Notice that TH is an adaptive threshold that is representative of a silence (non speech) frarne and is defined as in Figure 17. CR in this case is eqtral to CRm~x MA-I

Given R0, R1, RaV and CRl~;X,, the VIUV part of Process I calculates the status VMn~, of the n~1 frarne. The flow~ of this part of the algorithrn is shown in Figure 18 where "V"
r~ c.se.~(s the output V/UV flag of this procedure. Setting the "V" flag to I or 0 in~ trs voiced or unvoiced classification respectively. The "CR" parameter denotes the maxi.~lw~l value of the CR function which is calculated in the pitch estimation process. A diagr~mm~+ic .e~ e.-t..l inn of the voiced/unvoiced procedure is given in Figure 19.

Having the VMn+, value, the PMn+, estimate and the V'l, and P'n f ~tim~t~S which have been produced from Process I oy.,~a~ g on the previous nth coding frame, as illustrated in Figure 7, part b, two fi~rther locations Mn+,~dl and Mn+,~d2 are estim,.ted and the corresponding ~-147,147~ segments of filtered speech samples are obtained as illustrated in Figure 7, part b.
These additional t~,vo analysis frames are used as input to the "Pitch Fssim,.tion process" of , .
W O 98/01848 PCT/GB97/Ol Figure 8 to yield PMn+l+dl and PM,~I+d2 The procedure for calculating dl and d2 is given in the flowchart of Figure 20.

The final step in part (a) of Process I of Figure 7, evolves the previous V/UV classification procedure of Figure 8 with inputs R0, ~1, ~v, and CP = max[CR"~',X" CRM"." di ~ CRMr~,+-12 ]
to yield a pre~ y value V,~

In addition, a multipoint pitch çstim~tion algorithm accepts PMn~" PMn~+dl~ PMn~(d2~ Vn-l, Pn l, V'n, P~n to provide a prelimin~ry pitch value pt,l,rl The flowchart of this multipoint pitch e~ n1ion algo~ l is given in Figure 21, where P~, P2 and PO l~l,Lesent the pitch estimates associated with the Mn~ltd" Mn+l +d2 and Mn+~ points respectively, and P denotes the output pitch ~ le of the process, that is Pnt l .

Firlally part (b) Process I of Figure 7 imposes constraints on theV"Prl and p~lfr~ estirnates in order to ensure a smooth evolution for the pitch parameter. The flowchart of this section is given in Figure 22. At the start of this process "V" and "P" .~ sent the voicing flag and pitch e.~ttmZlt5 values before constraints are applied (V""r, and P"~', in Figure 7) whereas at the end of the process "V" and "P" lc~lese,lt the voicing flag and pitch estimate values after the con~ . have been applied (V"',l and ~,'f,). The V'n+l and Pln+~ produced from this section are then used in the next pitch past processing section together with Vn l7 V'n, Pn l and P'n to yield the final voiced/unvoiced and pitch estimate parameters Vn and Pn for the nth coding frame. ~is pitch post processing stage is defined in the flowchart of Figures 23, 24 and 2~, the output A of Figure 23 being the input to Figure 24, and the output B of Figure 24 being the input to Figure 25. At the start of this procedure ~Pn~ and "Vn'7 represent the pitch e~timSlte and voicing flag respectively, which correspond to the nth coding frarne prior to post L~luC~..Siilg ~i.e. P"', V"' ) whereas at the end of the procedure ~Pn~ and ~Vn~ esent the final pitch ~stim~te and voicing flag associated with the nth frame (i e. Pl~, Vn).

The LPC analysis proeess (Process II of Figure I ) can be performed using the Autocorrelation, Stabilised Covariance or Lattice methods- The Burg algorithrn was used, although simple autocorrelation schemes could he employed without a noticeable effeet in the decoded speech quality. The LPC coefficients are then transforrned to an LSP ~ s~ n Typical values for the number of coefficients are 10 to 12 and a 10th order filter has been used. LPC analysis processes are well known and deseribed in the literature, for example "Digital Proeessing of Speech Signals", L.R. Rabiner and R.W. Schafer, Prentiee - Hall Ine., Englewood Cliffs, ~ew Jersey, 1978. Similarly, LSP rep~se,~ ions are well known, for exarnple from "Line Spectrum Pair and Speech Data Compression", F Soong and B.H. Juang, Proe. ICASSP-84, pp 1.10.1-1.10.4, 1984. Accordingly these processes and Le~ on~
will not be deseribed further in this doenm~nt In proeess II, ten LSP coefficients are used to ,~pLesellt the data. These 10 coeffieients eould be qn~nti7e~l using sealar 37 bits with the following bit allocation pattern [3.4,4,4,4,4,4,4,3,3].
This is a relatively simple process, but the resulting bit rate of 1850 bits/second is nnecçsszlrily high. ~l~ ivc:ly the LSP coefficients ean be Veetor Qn~ntice~l (VQ) using a Split-VQ technique. In the Split-VQ technigue an LSP p~a..lclel vector of dimension "p" is split into two or more subvectors of lower dimensions and then, each subveetor is Vector Qll~ntice-l separately (when Veetor Qn~nticin~ the subveetors a direct VQ approaeh is used).
In effeet, the LSP transforrned coeffieient vector, C, which consists of "p" consecutive eoefficients (el,c2,...,cp) is split into "K" vectors, Ck (lskSK), with the corresponding imt?n~ions dk (lSdkSp). p=dl+d2+...~dk. In particular, when "K" is set to "p'' (i.e. when C
is partitioned into "p" elements) the Split-VQ becomes equivalent to Scalar On~nti~tion. On tne other hand, when K is set to unity (K=l, dk=p) the Split-VQ becomes equivalent to Full Searc~ VQ.

, PCT/GB97/OlUI

The above Split VQ approach leads to an LPC filter bit rate of the order of 1.3 to 1.4Kbits/sec. In order to minimi7e further the bit rate of the voice coded system described in this docllm~nt a Split Matrix VQ (SMQ) has been developed in the University of l\J~nch~st.or and reported in '~Efficient Coding of LSP Parameters using Split Matrix Qlls7T~ s7tion~, C.Xydeas and C.Pal ~n~Sts~iou, Proc ICASSP-95, pp 74()-743, 1995. This method results in transparent LPC ql-~ntir~tion at 900bits/sec and offers a flexible way to obtain, for a given q~ nti~s~.~iOn accuracy, the required memory/complexity characteristics for Process III. An important feature of SMQ is a new weighted Euclidean distance which is defined in detail as follows.

D~Lk(l),~'~(I))= ~ [~ " --LSP' )~ (s,~)'w,(t)'] .~29) where L'k (I) represents the kth (k=l,...,K) q~ n~i7~rl submatrix and LSr~ (k ~ y ate its eJ~n~n~ m(k) le~l~sc,.ll, the spectral dLmension of the kth submatrix and N is the SMQ
f~rne t~im~n.~ion. Note also that: S(k) = ~~ m(j), m(0) = 1 and ~, m(k) = p j=O k=l [ Aver(En)] ~ ) f )r transmi. sion trames 0 S t S N--I (30) when the N LPC frames consist of both voiced and unvoiced frames w,(t)= En(t)~ otherwise where Er(t) is the nonnS~ eQ energy of the prediction error of the (l+t)th frame, En(t) is the RMS value of the (l+t)tn speech frame and Aver(En) is the average RMS value of the N LPC
frames used in SMQ. The values of the constants a and a 1 are set to 0.2 and 0.15 l es~,ecLi lrely .
Also:
WJi(S,t) ¦LSP~kl)~r¦ (31) where P(~t~" " ) is the value of the power envelope spectlurn of the (l+t) speech frame at the +s LSP~(k' 1)1 s frequency. ,B is equal to 0.15 The overall SMQ ql~nti~tion process that yields the qll~nti.ce-l LSP coefficients vectors l I
to l I~N-I for the 1 to I~N-1 analysis frarnes is shown in ~igure 26. Tllis figure also includes the inverse process, which accepts tlle above l l+i vectors i=O,..,N-I and provides the corresponding LPC coefficients vector a' to â'+N~' . The a' i i=O,..,N-I, coefficients vectors are modified, prior to the LPC to LSP transformation. by a 10 Hz bandwidth expansion as indicated in Figure 26. A 5Hz bandwidtlI expansion is also included in the inverse qll~nti~Zltion process.

Process IV of Figure I will now be described. This process is concemed with the rnixed voiced classification of harmonics. When the nth coding frame is classified as voiced, the residual signal Rn(i) of lengtn 160 samples centred at the middle Mn Of the nth coding fiarne and the pitch period Pn for that frame are used to deterrnine the strongly voiced (hvj=l)/weakly voiced (hvj=0) classification associated with the jth harrnonic c~jn. The flowcllaLL of Process I~l is given in Figure 27. The R" array of 160 samples is E~mmin~;
windowed and augmented to forrn a 512 size array, which is then FFT processed. The x;~ and m;~ "l"" values MGRma", MGRmjn of the resulting 256 spectral m~nit11~1e values are ~ terminp(l~ and a threshold TH0 is calculated. TH0 is then used to clip the m~nitu~le spectrurn. The clipped MGR array is searched to define peaks MGR(P) satisfying:

MGR(P)>MGR(P+l) and MGR(P)>MGR(P-l) For each peak, MGR(P), "supported" by the MGR(P~l ) and MGR(P-l ) values a second order polynomial is fitted and the maximum point of this curve is accepted as MGR(P) with a location loc(MGR~P~. Further constraints are then imposed on these magnitude peaks. In particular peaks are rejected:

CA 02259374 1998-12-29 '~
PCT/GB9710~31 W 098101~48 a) if there are spectral peaks in the neighbourhood of loc(MGR(P)) (i.e in the range (loc(MGR(P~-fo/2 to loc(MG~(P))+fo/2 where fo is the fundamental ~requency in Hz), whose value is larger than 80% of MGR(P) or b) if there are any spectral magnitudes in the same range whose value is larger than MGR(P) After applying these two constraints tlle rem~ining spectral peaks are characterised as "domin~nt" peaks. The ob~ective of the rcrn~ining part of the process is to examine if there is a "dornin~nt" peak near a given harmonic jxc~, in which case the harrnonic is classified as strongly voiced and hvj--l, otherwise hvj=0. In particular, two thresholds are defined as follows:
THI=O.lSxfo. TH2=(1.5/Pn)xfo with fo=(l/Pn)xfs and fs is the sarnpling frequency.

The .lifr~,~,nce (loc(MGRd(k~-loc(MG~d(k--l)) is compared to 1.5xfo+IX2, and if larger a related ha~nonic is not associated with a "cl-~minzln~" peak and tlle corresponding ç7~;fi~tic r- hv is zero (wealcly voiced). (loc(M~Rd ~k)) is the location of the kth do., .;. ,~

peak and lc=l,...,D where D is the number of dominant peaks. This procedure is described in detail in Figure 28, in which it should be noted that the harmonic index j does not akvays s~o~d to the m~nih~ spectrum peak index k, and loc(k) is the location of the kth do...,..-,."1 peak, i.e. loc (MGRd(k)) = loc(K) In order to minimi~e the bit rate associated with the tr~n~mi~ion of the hvj information, two s~hPm~s have been employed which coarsely represent hv.

Sc*eme I
The :~yC~ is divided into bands of 500Hz each and a strongly voiced/weakly voiced flag Bhv is assigned for each band. The first and last 500Hz bands i.e. 0 to S00 and 3500 to . , WO 98/01848 4000H~ are always regarded as strongly voiced (Bhv= I ) and weakly voiced (Bhv=0) respectively. When Vn=1 and Vn l=l the 500 to lO00 Hz band is classified as voiced i.e.
Bhv=l. Furthermore, when Vn=l and V,~ 1=~ the 3000 to 3500 Hz balld is classified as weakly voiced i.e. Bhv=0. The Bhv values of the r( m~ininE S bands are determined using a majority decision rule on the hvj values of the j llarmonics which fall withill the band under consideration. When the number of harmonics for a given band is even and no clear majority can be established i.e. the nurnber of h~nnonics with hvj=l is equal to the nurnber of h~rrn~nics with hvj=0, then the value of Bhv for that band is set to the opposite of the value assigned to the imm~ t~ly preceding band. At the decoding process the hvj of a specific h~rrnonic j is equal to the Bhv value of the corresponding band. Thus the hY information may be t~ with S bits.

Sckeme n In this case the 680 Hz to 3400 Hz range is represented by only two variable size bands.
When Vn=l and Vn l=0 the Fc frequency that separates these two bands can be one of the following:
(~) 680, 1360, 2040, 2720.
whereas, when Vn=l and ~n l=l, Fc can be one of the following frequencies:
(B) 1360, 2040, 2720, 3400.
Fur~h~nnore, the 0 to 680 and 3400 to 4000 Hz bands are always represented with Bhv= I arld Bhv=0 respectively. The Fc frequency is selected by e~c~mining the three bands sequentially defined by the frequencies in (A) or (B) and by using again a majority rule on the harmonics which fall within a band. When a band with a mixed voiced classification Bhv=0 is found, i.e.
the n~nber of harmonics with hvj=0 is larger than to the number of harmonics with hvj=l, then Fc is set to the lower boundary of this band and the r~m~inin~ spectral region is ~ classified as Bhv=0. In this case only 2 bits are allocated to define Fc. The lower band is strongly voiced with Bhv=l, whereas the higher band is weakly voiced with Bhv=0 PCTtGB97/0 183 1 To illustrate the effect of the mixed voice classification on the speech synt~si~e-l from the tr~n~mitt~cl information, Figures 29 and 30 represent respectively an original speech waveform obtained for the utterance "Industrial sllares were mostly a" and frequency tracks obtained for that utterance. The horizontal axis represents time in terms of frames each of 20msec duration. Figure 31 shows to a larger scale a section of Figure 30, and represents frequency tracks by full lines for the case when the voiced frames are all deemed to be strongly voiced (hv=l ) and by dashed lines when the strongly/wealcly voiced classification is taken into account so as to introduce random perturbations when hv=0.

Figure 32 shows four waveforms A, 13, C and D. Waveform A ~ t;se~lts the magnitude spectrum of a speech segment and the corresponding ~P~ spectral envelope (loglO domain).
Waveforrns B, C and D l~.es~ t the normalised Short-Terrn m~nit~lrle :j~ct;llull~ of the corresponding residual segment (B), the excitation segment obtained using the binary (voiced/unvoiced) excitation model (C), and the excitation se~mt-nt obtained using the strongly voiced/weakly voiced/unvoiced hybrid excitation model (D). It will be noted that the hybrid model introduces an ~p~ .iate amount of randomness where required in the 3~/4 to ~ range such that curve D is a much closer approximation to curve B than curve C

Process V of Figure l will now be described. Once the residual signal has been derived, a cegm~nt of Pn samples is obtained in the residual signal (lom~in The magnitude spectrum of the segment, which contains excitation source information, is derived by applying a Pn points DFT. An ~1t~ iv~: solution, in order to avoid the computational complexity of the P" points DFT, is to apply a fix length FFT (128 points) and to find the value of the magnitude spectrum at the desired points, using linear interpolation.

For a real-valued sequence x(i) ~f P points ~lle DFT may be expressed as:

PCT/GB97/0~1 W 098/01~48 XO = ~ xfi) cosf p ) - j ~ x(i) sin( p /eO ~--O
The Pn point DFT will yield a double-side spectrum. Thus. in order to represent the excitation signal as a superposition of sinusoidal signals, the magnitude of all the non DC components must be multiplied by a factor of 2 The total number of single side magnitude spectrurn values, which are used in the reconstruction process, is equal to r(P + 1) / 2~

Process VI of Figure 1 will now be described. The DFT (Process V) applied on the Pn sarnples of a pitch segment in the residual domain, yields r(P. ~ 1) / 21 spectral m~gn~ s (MGjn, O<j<r(Pa + 1) / 21) and r(~, + 1) / 21 phase values. The phase inforrnation is neglected. However, the continuity of the phase between ~ cPnt voiced frames is p.~s~ ved.
Moreover, the contribution of the DC magnitude component is assumed to be negligible and thus, MGo is set to 0. In this way, the non-DC m~gnitucle ~IJect~ is assumed to contain all the ~ L~lally important inform~tion Based on the as~ Lion of an "~plo~ ately" flat shape m~nit~ spectrum for the pitch residual segmerlt, various methods could be used to re~l~se-lL the entire magnitude spectrùm with a single value. Specifically, a modified single value spectral amplitude ~pieselltation (MSVSAR) technique is described below MSVSAR is based on the observation that some of the speech spectrum resonance and anti-reson~nce ;nform~1ton is also present at the residual magnitude spectrum (G.S. Kang and S.S.
Everett, "Improvement of the Excitation Source in the Narrow-Land Linear Prediction Vocoder", IEEE Trans. Acoust., Speech and Signal Proc., Vol. ASSP-33, pp.377-386, 1985).
LPC inverse filt~ring can not produce a residual signal of absolutely flat magnitude ~-~ecL~
- mainly due to: a) the "cascade Ic~-~,selltation" of formats by the LPC filter l/A(z), which results in the magnitudes of the resonant peaks to be dependent upon the pole locations of the l/A(z) all-pole filter and b) the LPC qll~ntic~ion noise. As a consequence, the LPC residual W O 98/01848 PCT/GB97/01$31 signal is itself highly intelligible. Based on this observation the MGj magnitudes are obtained by spectral sarnpling at the harmonic locations. co~ (P" + 1)/ 2~, of a modified LPC synthesis filter, that is defined as follows:
GN (32) 1 - GR ~ U~'Z~~

where, â,', i=l,...,p replesellt the p q~ ti~e<l LPC coefficients of the nth coding frarne and GR and GN are defined as follows:
GR = GR ~IrI (I--~, ) (333 and 1 2/~-l _ ~x~ )2 G~ = r(P~ J2~-1 n i~O 2 (34) (MP(~3 J )H(~ J )) / 2 where Kjn, i=l,...,p are the reflection coefficients of the nth coding frame, x,lr"'(i) represents a sequence of 2Pn speech samples centred in the middle of the nth coding frame from which the mean value is being calculated and rernoved, MP(~ ~' ) and ~ e~e~1L the frequency response of the MP(z) and 1/A(z) filters respectively at the C~)j" frequency. Notice that the MP(C~,.) values are c~lc~ ted assuming GN=I The GR parameter represents a const~nt whose value is set to 0.25.

Equation 32 defines a modified LPC syntllesis filter with reduced feedbark gain, whose frequency response consists of nearly equalised resonant peaks, the locations of which are very close to the LPC synthesis resonant locations. Furthermore, the value of the fee~ibacl~
gain GR is controlled by the performance of the LPC model (i e. it is pmportional to the no~ e~1 LPC prediction error). In addition Equation 34 ensures that the energy of the reproduced speech signal is equal to the energy of the original speech waveform. Robustness is increased by CO~ u~ g the speech RMS value over two pitch periods.

-PCT/GB97/0~1 Two alternative m~nit~ e spectrum representation techniques are described below, which allow for better coding of the magnitude information and lead to a significant improvement in reconstructed speech quality.

The first of t'ne altemative magnitude spectrum leplGselltations teclmiques is referred to below in the "Na amplitude system". The basic principle of this MG,' q11~7nti~tion system is to ~ esc,lL ~ccurately those MGjn values which correspond to tne Na largest speech Short Term (ST) spectral envelope values. ln particular, given the LPC coefficients of the nth coding frarne, the ST magnitude spectrum envelope is calculated (i.e. sampled) at the harmonic frequencies ~, and the locations lc(j), j=l,...,Na of the largest Na spectral samples are detPrmined. These locations indicate effectively which of the r(P + 1) /21--1 MG, magnitudes are subjectively more important for ,.~c11~f~ qtl,-nti7~tion The system subsequently selects MGjn j=lc(1),...,lc(Na) and Vector Q11.n1i7Ps these values. If the ..,i.~i",~ pitch value is 17 samples, the number of non-DC MG'; amplitudes is equal to 8 and for tnis reason Na~8.. Two variations of the "Na-arnplitudes system" were developed with equivalent pcl~o~ allce and their block diagrams are depicted in Figure 33 (a) and (b) respectively.

i) Na-~mplit~ s system with Mean Norrn~ii7s~tion Factor. In this variation, a pitch se~mPnt of Pn residual sarnples Rn(i), centered about tlle middle M" of the nth coding frame is obtained and DFT transformed. The mean value of the spectral magnitudes MG',, j=l,....
I (Pn + 1~ / 2~ is calculated as:

~ 1-~ MGjn m= j~~ (35) P" +l -I

CA 02259374 1998 - i2 - 29 ' PCI~/GB97/018:~1 WO 98/011~48 m is quantized and then used as the norrrtz~li7z7tion factor of the Na selected amplitudes MGf', j=lc(l),...,lc(Na). The resulting Na arnplitudes are then vector quantized to MG, .

ii) Na-arnplitudes system with RMS Norlllzlli7z7~ion Factor. In this variation the RMS value of the pitch segment centered about the middle M" of the nth coding frarne, is calculated as:
p ~R"(i~2 g ,. i~o (36) g is 4~1z ..1;~ d and then used as the norrnz~i7~ion factor of the Na selected amplitudes MG,., j=lc(l~,...,lc(Na). These norrnzlli7~d amplitudes are then Vector Qllzlnticed to MG;'. Notice that the Pn points DFT operation can be avoided in this case, since the magnitude s~e~ l of tlle pitch segrnent is calculated only at the Na selected harrnonic frequencies ~ "
J----Ic(l),...,lc(Na).

In both cases the qllz nticzttinn of the m and g factors, used to norrnalize the MG;' values, is performed using an adaptive ,u-law quantiser with a non-linear characteristic as:

log~ (1 + ~1 AI / Am~x ) sgn~ ~) with l1=2SS

This arrangement for the qltztnti7~tion of g or m extends the dynarnic range of the coder to not less than 25dBs.

At the receiver end the decoder recovers the MG;' magnitudes as MG;.' = MG;" x Aj=lc(l),...,lc(Na). The remzinin~; r(P" +1)/21- Na--I MG,' values are set to a constant value A. (where ~ is either "m" or "g"). The block diagram of the adaptive ~L-Iaw quantiser is shown in Figure 34.

PCT/GB97/0~1 The second of the altemative magnitude spectrum le~lcsentation techniques is referred to below as the "Variable Size ~pectral Vector Q~l~ntic~fion (VS/SVQ)' system. Coding systems, which employ the general synthesis forrnula of Equation (I) to recover speech, encounter the problem of coding a variable length, pitch dependant spectral amplitude vector MG. The "Na- arnplitudes" MGJ' qll~ntic~tion schemes described in Figure 33 avoid this problem by Vector Qll~nticin~ the minimum expected nurnber of spectral amplitudes and by setting the rest of the MGJ' amplitudes to a fixed value. However, such a partially spectrally flat excitation model has limitations in providing high recovered speech quality. Thus, in order to improve the output speech quality, the shape of the entire { MG;' } magnitude spectrum should be qll~ntice~ Various techniques have been proposed for coding { MG;' }.
Originally ADPCM has been used across the MG,' values associated to a specific coding frame. Also ~ MG, } has been DCT transformed and coded differentially across successive MGJ m~ de spectra. However, these coding schemes are rather inefficient and operate with relatively high bit rates. The introduction of Vector Qll~ntic~til n on the ~ MG;' } spectral amplitude vectors allowed for the development of Sinusoidal and Prototype Interpolation systems which operate at around 2.4 Kbits/sec Two known ~ MGJ. } VQ methods are described below which quantise a variable size ~vsn) input vector with a fixed size (fxs) codevector.

i) The first VQ method involves the transforrnation of the input vector to a fixed size vector followed by conventional Vector Q~nti.c~tion. ~he inverse transformation on the quantised fixed si~ vector yields the recovered quantised MG" vector. Transformation techniques which have been used include, Linear Interpolation, Band Limited Interpolation, All Pole modelling and Non-Square transforrnation. However, the overall distortion produced by this approach is the sunrnation of the VQ noise and a component, which is introduced by the transformation process.

-W O 98/01848 PCT/GB971018~1 ii) The second VQ method achieves the direct ql-~n~ tion of a variable input vector with a fixed size code vector. This is based in selecting only vs,~ elements from each codebook vector, to forrn a distortion measure between a codebook vector and an input MG" vector.
Such a q~nti~fion approach avoids the transformation distortion of the previous techniques mentioned in (i) and results in an overall distortion that is equal to the Vector Qll~nti~tion noise.

An improved VQ method will now be described which is referred to below as the Variable Size Spectral Vector Qll~nti~tion (VS/SVQ) scheme This scheme was developed to take advantage of the underlying principle that the actual shape of the { MGj' } mslgnit~lde sl~c~ Ll is defined by a minimurn r(P" ~ 1) / 21 of equally spaced samples. If we consider t_e ~ x;~ l expected pitch estim~t~ P"l;,X, then any I MG,' ' spectral shape can be se~Led adequately by r(~, + 1) / 21 sarnples. This suggests that the fixed size fxs of the codebook vectors S' repr~nting theMG, shapes should not be larger thanr(~, + 1) / 21.
Of course this also implies that given the r(P" + 1) / 21 samples of a codebook vector, the complete spectral shape, defined at any frequency, is obtained via an interpolation process.

Figure 35 hi~hli~ht~ the VS/SVQ process. The codeboolc CBS having cbs fixed fxs ~lim~n~ion vectors SJ, j=l,...,fxs and i=l,...,cbs, where fxs isr(P" ~1) / 21, is used to quantise an input vector MG,, j=l,...,vsn of dimension vsn. Interpolation (in this case linear) is used on the Si vectors to yield S" vectors of fiin~erl~ion vsn. The Si to S" interpolation process is given by:
s ~ s (li~s) +(i~s ~ SD X si(~ sS ) -S ( iv~s; ) (38) vs" vs" vs" ,~s _ , lScs vs" v*"
for i=l,...,cbs and j=l,...,vsn _ , W O98tO1848 PCT/GB97/Ol~l This process effectively defines S'' spectral shapes at the c)', frequencies of the MG;' vector. A distortion measure D(S'',MG") is then deflned between the S" and MG"
vectors, and the codebook vector Sl that yields the minimum distortion is selected and its index I is transmitted. Of course in the receiver, Equation (38) is used to define MG" from S

If we assurne that PmaX~120 then fxs=60. However this value can be reduced to 50 without significant degradation by low pass filtering the signal synthesised from Equation (13. This is achieved by setting to zero all the harmonics MG;' in the region of 3.4 to 4.0KHz, in which case:
~3400x Pnl if vsns5o (39) SO=vsn otherwise.
and vsn<fxs.

Amplitude vectors, obtained from adjacent residual frarnes, exhibit xignific~nt re~ nrl~ncy, which can be removed by means of backward prediction. Prediction is performed on a harrnonic basis i.e. the arnplitude value of each harmonic MG;" is predicted from the amplitude value of the same harmonic in previous frarnes i.e. MG,'- ' . ~ fixed linear predictor MG = b X MG may be incorporated in the VS/SVQ system, and the resulting DPCM
is shown in Figure 36 (differential VS/SVQ, (DVS/SVQ?). In particular, error vectors are forrned as the difference between the original spectral arnplitudes MGj" and their predicted ones MGJ, i.e.:
EJ, = MG; _ MG/~ for~ VSn where the predicted spectral amplitudes MG,' are given as:
bxMGJ'-' when V",-I
MG;' = < , for l~<vsn l (40) O w*en Y" = O
and W O98101848 PCT/GB97/0~1 for vsn l<iSVSn (41) vs", ~.1 Furtherrnore the quantised spectral amplitudes MG', are given as:

M~jn ~ E,.' for I S j < vs"
MG,. = ~ (42) 1 ~ M&,~' for v*" < j < Ir 2 where ~in denotes the ~ Lised error vector.

The qllzinti~zition of the ~jn l<jsvsn error vector incorporates Mean Removal and Gain Shape Quzinti~tion techniques, using the hierarchical VQ structure of Figure 36.

A weighted Mean Square Error is used in the VS/SVQ stage of the system. The weiEh~;nE
function is defined as the frequency response of the filter: W(z) = I / A" (z / y ), where An(z) is the short-term linear prediction filter and y is a constant, defined as y=0.93. Such a weighting function that is ~.o~olLional to the short-term envelope 7~c~,L~lll, results in s~lhstzmtizilly Vt~d decoded speech quality. The weighting function W;' is norrnziiice~i so that:
~YJ = 1 ~43) i~l The pdf of the mean value of E" is very broad and, as a result, the mean value differs widely from one vector to another. This mean value can be regarded as statistically independent of the variation of the shape of the error vector E'' and thus, can be quantised separately without paying a substzinS;sil penalty in compression efficiency. The mean value of an error vector is czilclll~te~l as follows:
M = ~,~ Win x E~Jn (44) ~1 M is Optimum Scalar Q-lsinti~ed to M and is then removed from the original error vector to forrn ~rmn ~ M). The overall qilsinti7,ition distortion is attributed to the q~Isin1i7zition PCTIGB97/0 ~ 1 '. W O 98101~48 ofthe "Mean Removed" error vectors (Erm" ), which is perforrned by a Gain-Shape Vector Quantiser.
.

The objective of the Gain-Shape VQ process is to determine the gain value C and the shape vector S so as to minimi~e the distortion measure:
D(~rm" G x ~ - ~, WJ' [Frm'i - G x S~

A gain optimised VQ search method, similar to techniques used in CELP systems, is employed to find the optimum G c~nd S. The shape Codebook (CBS) of vectors S' is searched first to yield an index 1, which maximises the quantity:
.. ~;, 2 ~W; ~rm,'(i)SJ.
Q~i) = ' ~ for i=l,. .. cbs (46 ~7,,S,,2 ~'1 where cbs is the number of codevectors in tl~e CBS. The optimum gain value is defined as:
~.r"
~ Y~ Erm;'S; ' G= v.~,, (47) ~wns~l2 ~-1 and is O~tU11U111 Scalar Quantised to G .

During shape qu~nti~tion the principles of VS/SV(2 are employed, in the sense that the S' ', vsn size vectors are produced using Linear Interpolation on fxs size codevectors S' . Both trained and randomly generated shape CBS codebooks were investigated Although Ermn has noise-lilce characteristics, systems using randomly ~enerated shape codebooks resulted in unsatisfactory muffled decoded speech and were inferior to systems employing trained shape codebooks.

W O 98/01~48 A closed-loop joint predictor and VQ design process was employed to design the CBS
codebook, the optimurn scalar quantisers CBM and CBG of the mean M and gain G values respectively, and also to define the prediction coeff~cient b of Figure 36. In particular, the following steps take place in the design process.
STFP A0 (k=0). Given a training sequence of MG;" the predictor b~ is calculated in an open loop fashion (i.e. MGJ' = b x MG;'-' for l<j<r(P" + 1) / 21 when Vn l-l, or MG) = 0 elsewhere). Furtherrnore, the CBM0 mean, CBG0 gain and CBS0 shape codebooks are ~le~ignecl independently and again in an open loop fashion using un~lu,~ ed En, In particular:
a) Given a training sequence of error vectors E'l ~, the mean value of each En o is calculated and used in the training process of an Optimurn Scalar Quaritiser (CBM0) b) Given a ~ining sequence of error vectors En o and the CBM0 mean q~l~nti~er, the mean value of each error vector is calculated, qll~nti~ed using the CBM0 quantiser and removed from the original error vectors En o to yield a sequence of "Mean Removed" training vectors Errnn (~
c) Given a training sequence of Erm" ~ vectors, each "Mean Removed" training vector is norm~iised to unit power (i.e. is divided by the factor G = ~ ~Wj'(Erm,' ~ ), linear interpolated to fxs points, and then used in the training process of a conventional Vector Quantiser of fxs tlimen~ion~ (CBS0).
d! Given a training sequence of Erm'l ~ vectors and the CBS0 shape codebook, each "Mean Removed" training vector is encoded using Equations 46 and 47 and the value G of Equation 47 is used in the training process of an Optimurn Scalar Quantiser (CBG0).
kissettol(k=l).

' PCT/GB97/0 ~ 1 STEP A1 Given a training sequence of MGj and the mean, gain and shape codebooks of the previous k-l iterations (i.e. CBMI~-l, CBGk-l~ CBSk-l)~ tlle optimum prediction coefficient bk is calculated.
STEP A2 Given a training sequence of MGj, an optimum prediction coefficient bk and CBMk-l, CBGk-l~ CBSIc-l, a training sequence of error vectors Enk is formed, which is then used for the design of new mean, gain and shape codebooks (i.e. CBMk, (:~BGk. CBSk) STI=P A3 The performance of the kth iteration ql~nti7~tion system (i.e. bl~. CBMk, CBGk, CBSk~ is evaluated and compared against the q-7~nti7~tion system of the previousiteration (i.e. bk-l, CBMk-l, CBGk-l~ CBSk-l). If the q~nti7zltion distortion converges to a minimum, the qll~nti7~tion design process stops. Otherwise, k----lc+l and steps A1, A2 and A3 are repeated.

The performance of each quantizer (i.e. b~, CBMk, CBGk~ CBSIC) has been evaluated using subjective tests and a LogSegSNR distortion measure, which was found to reflect the subjective p~lr,~ ce of the system.

The design for the Mean-Shape-Gain Quantiser used in STEP A2 is performed using the following two steps:
ST~P B1 Given a training sequence of error vectors E" ~, the mean value of each E'lkis calculated and used in the training process of an Optimum Scalar Quantiser (CBMk).
STEP B2 Given a tra~ning sequence of error vectors Enk and the CBMk mean q-l~nti7to.r, the mean value of each residual vector is calculated, ql-~nti7~d and removed from the original residual vectors En k to yield a sequence of "Mean Removed" training vectors Ermn k, which are then used as the training data in the design of an optimum Gain Shape Qllz~nti7.o.r (CBGk and CBSk). This involves steps Cl - C4 below. (l~e q~ ;on design process is perforrned under the assumption o~ any independent PCTIGB97/Olg~l W O 98/01~48 ~4 gain shape quantiser structure, i.e. an input error vector ~m~ can be le~l~sent~d by any possible combination of ~ii codebook shape vectors and G gain ql-~nti7~r levels.) STEP Cl (v=O). Given a training sequence of vectors Ermn k and an initial CBGk-~ and CBS~-~ gain and shape codebooks respectively, compute the overall average distortion distance Dk.o as in Equation 44. Set v equal to 1 (v= I ).
STEP C2 Given a ~;nin~ sequence of vectors Erm" I; and the CBGk-V-I gain codebook from the previous iteration, compute the new shape codebook CBSk-Y which minimi~es the VQ distortion measure. Notice that tlle optimum CBSk V shape codebook is obtained when the distortion measure of Equation (44) is a minimurn and this is achieved in Ml k V iterations.
STEP C3 Given a training sequence of vectors Erm'l ~ and the CBSI~v shape codebook, compl-te a new gain quantiser CBGk-V, which minimi.ce the distortion measure of Equation (44). This optimum CBGk'V gain ql-~nti~e~ is obtained when the distortion measure of Equation (44) is a minimum and this is achieved in M2k v iterations.
STEP C4 Given a tra~ing sequence of vectors Erm'1 k and the shape and gain codebooks CBSk-Y and CBGk-Y, compute the average overall distortion measure. If (Dk v-l~
Dk.V)/Dk V<~ stop. Otherwise, v=v+I and go back to STEP C2.

The centroids s,k"Y~ ,cbs and u=l, .,fxs of the shape Codebook CBSk-V-"l, are updated during the mth iteration perforrned in STEP C2 (m= I ,...,M 1~; v) as follows:
(NC" j " + C" j "NC" ,." ) S~ ~r = C N:Er r~)l (48) (DC" j " + C" j "DC~" j "
~:E~rrr ' ~)1 where DCj~ =Wj~(Gkr-l x f";") NC" j n = WJr G~ V-' f" j J~ (Erm j' --G~ ' S,'',;' (""~' n PCT/GB97/Ol W O 98/~1848 fu,~ S ul, ~ 1 if fN, jJl < 1 ~J~ lo if fllJ~ll>

I y ljfXs-ul+ls~s vsll vsll c~
o if ~jfxs-ul+l~fxs u+ l iS U~ s u"~u, j, n) = ~ vs" and u--1 if u> j~
vs"

j-l if u<j~S

j + 1 y U > j ~
vs"

Q, denotes the cluster of Ermn k error vectors which are quantised to the S~ r-m-~ codebook shape vector, cbs le~esell~ the total n unber of shape ql1~nti~tion levels, Jn represents the CBGk-V-I gain codebook index which encodes the Erm'l ~ error vector auld ISjsvs".

The gain centroids, Gl Vm, i=l,...,cbg of the CBG~:-V-~n gain quantiser, which are computed during the mth iteration in STE~P C3 (m=l,...,M2k v), are given as:

CA 02259374 l998-l2-29 P~: l 1~97/0 1~3 1 ' (~ Erm~S~ ~ W~) G' ~ ~ ,t E,~ .D, i 1 ~ (49) (~(S,~ Y) W,.~) where Dj denotes the cluster of Ermn k error vectors which are qtl~nticed to the Gk-V-n'-' gain quantiser level, cbg ~lest;llts the total number of gaLTl qu~ntis~tion levels, In le~senl~ the C~3Sk-Y shape codebook index which encodes the Errnn ~ error vector and l<j<vsn.

The above employed design process is applied to obtain the optimum shape codebook CBS, optirnurn gain and mean qu:~n~i7ers7 CBG and C~BM and the optimum prediction coe~lcient b which was fin~lly set to b=0 35.

Process VII calculates the energy of the residual signal. The LPC analysis performed in Process II provides the prediction coefficients a; I<i<p and the reflection coefficients k p. On the other hand, the Voiced/Unvoiced classification perforrned in Process Iprovides the short term autocorrelation coeff1cient for zero delay of the speech signal (R0) for the frame under consideration. Hence~ the Energy of the residual signal E" value is given as:

En = MRO~ K,)2 (50) The above e~ ion represents the minimum prediction error as it is obtained from the I,inear Prediction process. However, because of qll~nliY~tion distortion the pararneters of the l,PC filter used in the coding-decoding process are slightly different from the ones that achieve minim-lm prediction error. Thus, Equation (50) gives a good apl)lo~ ation of the residual si~nal energy v~ith low computational requirements The accurate En value can be given as:

PCT/GB97/018~1 ' '. WO 98/01848 M-l E" = M~R (i) (51) The res--ltin~ is then Scalar Quantised USillg an adaptive ~-law qll~ntice~l arrangement similar to the one depicted in Figure 34 In the case where more than one ~/~ are used in the system i.e. the energy En is calculated for a number of subframes then ~" ~ is given by the general equation:
Ms -I
E"~, = M ~ R ~i + ~tM.r ~ O S ~ 5 ~ (52) Notice that when - = L Ms-M and for ~ = 4, M~=M/4.

Claims

1. A speech synthesis system in which a speech signal is divided into a series of frames, and each frame is converted into a coded signal including a voiced/unvoiced classification and a pitch estimate, wherein a low pass filtered speech segment centred about a reference sample is defined in each frame, a correlation value is calculated for each of a series of candidate pitch estimates as the maximum of multiple crosscorrelation values obtained from variable length speech segments centred about the reference sample, the correlation values are used to form a correlation function defining peaks, and the locations of the peaks are determined and used to define a pitch estimate.

2. A system according to claim 1, wherein the pitch estimate is defined using an iterative process.

3 A system according to claim 1 or 2, wherein a single reference sample may be used, centred with respect to the respective frame.

4. A system according to claim 1 or 2, wherein multiple pitch estimates are derived for each frame using different reference samples, the multiple pitch estimates being combined to define a combined pitch estimate for the frame.

5. A system according to any preceding claim, wherein the pitch estimate is modified by reference to a voiced/unvoiced status and/or pitch estimates of adjacent frames to define a final pitch estimate.

6. A system according to any preceding claim, wherein the correlation function is clipped using a threshold value, remaining peaks being rejected if they are adjacent to larger peaks.

7. A system according to claim 6, wherein peaks are selected which are larger that either adjacent peak and peaks are rejected if they are smaller than a following peak by more than a predetermined factor.

8. A system according to any preceding claim, wherein the pitch estimation procedure is based on a least squares error algorithm.

9. A system according to claim 8, wherein the pitch estimation algorithm defines the pitch valve as a number whose multiples best fit the correlation function peak locations.

10. A system according to any preceding claim, wherein possible pitch values are limited to integral numbers which are not consecutive, the increment between two successive numbers being proportional to a constant multiplied by the lower of those two numbers.

11. A speech synthesis system in which a speech signal is divided into a series of frames, and each frame is converted into a coded signal including pitch segment magnitude spectral information, a voiced/unvoiced classification, and a mixed voiced classification which classifies harmonics in the magnitude spectrum of voiced frames as strongly voiced or weakly voiced, wherein a series of samples centred on the middle of the frame are windowed to form a data array which is Fourier transformed to produce a magnitude spectrum, a threshold value is calculated and used to clip the magnitude spectrum, the clipped data is searched to define peaks, the locations of peaks are determined, constraints are applied to define dominant peaks, and harmonics not associated with a dominant peak are classifed as weakly voiced.

12. A system according to claim 11, wherein peaks are located using a second order polynomial

13. A system according to claim 11 or 12, wherein the samples are Hamming windowed.

14. A system according to claim 11, 12 or 13, wherein the threshold value is calculated by identifying the maximum and minimum magnitude spectrum values and defining the threshold as a constant multiplied by the difference between the maximum and minimum values.

15. A system according to any one of claims 11 to 14, wherein peaks are defined as those values which are greater than the two adjacent values, a peak being rejected from consideration if neighbouring peaks are of a similar magnitude or if there are spectral magnitudes in the same range of greater magnitude.

16. A system according to any one of claims 11 to 15, wherein a harmonic is considered as not being associated with a dominant peak if the difference between two adjacent peaks is greater than a predetermined threshold value.

17. A system according to any one of claims 11 to 16, wherein the spectrum is divided into bands of fixed width and a strongly/weakly voiced classification is assigned for each band.

18. A system according to any one of claims 11 to 17, wherein the frequency range is divided into two or more bands of variable width, adjacent bands being separated at a frequency selected by reference to the strongly/weakly voiced classification of harmonics.

19. A system according to claim 17 or 18, wherein the lowest frequency band is regarded as strongly voiced, whereas the highest frequency band is regarded as weakly voiced.

20. A system according to claim 19, wherein the event that a current frame is voiced, and the following frame is unvoiced, further bands within the current frame will be automatically classified as weakly voiced.

21. A system according to claim 19 or 20, wherein the strongly/weakly voiced classification is determined using a majority decision rule on the strongly/weakly voiced classification of those harmonics which fall within the band in question.

22. A system according to claim 21, wherein, if there is no majority, alternate bands are alternately assigned strongly voiced and weakly voiced classifications.

23. A speech synthesis system in which a speech signal is divided into a series of frames, each frame is defined as voiced or unvoiced, each frame is converted into a coded signal including a pitch period value, a frame voiced/unvoiced classification and, for each voiced frame, a mixed voiced spectral band classification which classifies harmonics within spectral bands as either strongly or weakly voiced, and the speech signal is reconstructed by generating an excitation signal in respect of each frame and applying the excitation signal to a filter, wherein for each weakly voiced spectral band, an excitation signal is generated which includes a random component in the form of a function which is dependent upon the respective pitch period value.

24. A system according to claim 23, wherein the spectrum is divided into bands and a strongly/weakly voiced classification is assigned to each band.

25. A system according to claim 23 or 24, wherein the random component is introduced by reducing the amplitude of harmonic oscillators assigned the weakly voiced classification, disturbing the oscillator frequencies such that the frequency is no longer a multiple of the fundamental frequency, and then adding further random signals.

26. A system according to claim 25, wherein the phase of the oscillators is randomised.

27. A speech synthesis system in which a speech signal is divided into a series of frames, and each voiced frame is converted into a coded signal including a pitch period value LPC coefficients and pitch segment spectral magnitude information, wherein the spectral magnitude information is quantized by sampling the LPC short term magnitude spectrum at harmonic frequencies, the locations of the largest spectral samples are determined to identify which of the magnitudes are relatively more important for accurate quantization, and the magnitudes so identified are selected and vector quantized.

28. A system according to claim 27, wherein a pitch segment of P n LPC
residual samples is obtained, where P n is the pitch period value of the nth frame, the pitch segment is DFT transformed, the mean value of the resultant spectral magnitudes is calculated, the mean value is quantized and used as a normalisation factor for the selected magnitudes, and the resulting normalised amplitudes are quantized.

29. A system according to claim 27, wherein the RMS value of the pitch segment is calculated, the RMS value is quantized and used as a normalisation factor for the selected magnitudes, and the resulting normalised amplitudes are quantized.

30. A system according to any one of claims 27 to 29, wherein, at the receiver, the selected magnitudes are recovered, and each of the other magnitude values is reproduced as a constant value.

31. A speech synthesis system in which a variable size input vector of coefficients to be transmitted to a receiver for the reconstruction of a speech signal is vector quantized using a codebook defined by vectors of fixed size, the codebook vectors of fixed size are obtained from variable sized training vectors and an interpolation technique which is an integral part of the codebook generation process, codebook vectors are compared to the variable sized input vector using the interpolation process, and an index associated with the codebook entry with the smallest difference from the comparison is transmitted, the index being used to address a further codebook at the receiver and thereby derive an associated fixed size codebook vector, and the interpolation process being used to recover from the derived fixed sized codebook vector an approximation of the variable sized input vector.

32. A system according to claim 31, wherein the interpolation process is linear, and for an input vector of given dimension, the interpolation process is applied to produce from the codebook vectors a set of vectors of that given dimension, a distortion measure is then derived to compare the interpolated set of vectors and the input vector, and the codebook vector is selected which yields the minimum distortion.

33. A system according to claim 32, wherein the dimension of the vectors is reduced by taking into account only the harmonic amplitudes within an input bandwidth range.

34. A system according to claim 33, wherein the remaining amplitudes are set to a constant value.

35. A system according to claim 34, wherein the constant value is equal to the mean value of the quantized amplitudes.

36. A system according to any one of claims 31 to 35, wherein redundancy between amplitude vectors obtained from adjacent residual frames is removed by means of backward prediction.

37. A system according to claim 36, wherein the backward prediction is performed on a harmonic basis such that the amplitude value of each harmonic of one frame is predicted from the amplitude value of the same harmonic in the previous frame or frames.

38. A speech synthesis system in which a speech signal is divided into a series of frames, each frame is converted into a coded signal including an estimated pitch period, an estimate of the energy of a speech segment the duration of which is a function of the estimated pitch period, and LPC filter coefficients defining an LPC spectral envelope, and a speech signal of related power to the power of the input speech signal is reconstructed by generating an excitation signal using spectral amplitudes which are defined from a modified LPC spectral envelope sampled at harmonic frequencies defined by the pitch period.

39. A system according to claim 38, wherein the magnitude values are obtained by spectrally sampling a modified LPC synthesis filter characteristic at the harmonic locations related to the pitch period.

40. A system according to claim 39, wherein the modified LPC synthesis filter has reduced feed back gain and a frequency response which consists of equalised resonant peaks, the locations of which are close to the LPC synthesis resonant locations.

41. A system according to claim 40, wherein the value of the feed back gain is controlled by the performance of the LPC model such that it is related to the normalised LPC prediction error.

42. A system according to any one of claims 38 to 41, wherein the energy of the reproduced speech signal is equal to the energy of the original speech waveform.

43. A speech synthesis system in which a speech signal is divided into a series of frames, each frame is converted into a coded signal including LPC filter coefficients and at least one parameter associated with a pitch segment magnitude, and the speech signal is reconstructed by generating two excitation signals in respect of each frame, each pair of excitation signals comprising a first excitation signal generated on the basis of the pitch segment magnitude parameter or parameters of one frame and a second excitation signal generated on the basis of the pitch segment magnitude parameter or parameters of a second frame which follows and is adjacent to the said one frame, applying the first excitation signal to a first LPC filter the characteristics of which are determined by the LPC filter coefficients of the said one frame and applying the second excitation signal to a second LPC filter the characteristics of which are determined by the LPC filter coefficients of the said second frame, and weighting and combining the outputs of the first and second LPC filters to produce one frame of a synthesised speech signal.

44. A system according to claim 43, wherein the first and second excitation signals include the same phase function and different phase contributions from the two LPC filters.

45. A system according to claim 44, wherein the outputs of the first and second LPC filters are weighted by half a window function such that the magnitude of the output of the first filter is decreasing with time and the magnitude of the output of the second filter is increasing with time.

46. A speech coding system which operates on a frame by frame basis, and in which information is transmitted which represents each frame as either voiced or unvoiced and, for each voiced frame, represents that frame by a pitch period value, quantized magnitude spectral information, and LPC filter coefficients, the received pitch period value and magnitude spectral information being used to generate residual signals at the receiver which are applied to LPC speech synthesis filters the characteristics of which arc determined by the transmitted filter coefficients, wherein each residual signal is synthesised according to sinusoidal mixed excitation synthesis process, and a recovered speech signal is derived from the residual signals.

47. A speech synthesis system substantially as hereinbefore described with reference to the accompany drawings.