CA2219358A1 - Speech signal quantization using human auditory models in predictive coding systems - Google Patents

Speech signal quantization using human auditory models in predictive coding systems Download PDF

Info

Publication number
CA2219358A1
CA2219358A1 CA 2219358 CA2219358A CA2219358A1 CA 2219358 A1 CA2219358 A1 CA 2219358A1 CA 2219358 CA2219358 CA 2219358 CA 2219358 A CA2219358 A CA 2219358A CA 2219358 A1 CA2219358 A1 CA 2219358A1
Authority
CA
Canada
Prior art keywords
signal
speech
pitch
lpc
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA 2219358
Other languages
French (fr)
Inventor
Juin-Hwey Chen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
At&T Corp.
Juin-Hwey Chen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by At&T Corp., Juin-Hwey Chen filed Critical At&T Corp.
Publication of CA2219358A1 publication Critical patent/CA2219358A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/002Dynamic bit allocation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/12Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a code excitation, e.g. in code excited linear prediction [CELP] vocoders

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

A speech compression system called "Transform Predictive Coding" or TPC, provides encoding for 7 kHz band speech at 16 kHz sampling at a target bit-rate of 16 or 32 kb/s one or two bits per sample. The system uses short and long term prediction to remove redundancy. The prediction residual is transformed and coded in the frequency domain as shown on the figure by (110) after accepting time domain data from (60) and parameter input from (100), which corrects the spectrum for auditory perception. The TPC coder uses only open-loop quantization as shown by (70) and therefore has low complexity. The speech quality is transparent at 32 kb/s, is very good at 24 kb/s, and is acceptable at 16 kb/s.

Description

CA 022l9358 l997-l0-24 W O 97/31367 . PCT~US97/02898 RESIDUAL SIGNALS WITH Q~IANTIZATION BY AUDITORY MODELS

Field of the Invention " ~ The present invention relates to the compression (coding) of audio signals, for example, speech signals, using a predictive coding system.

Background of the Inventic~n As taught in the literature of signal compression, speech and music 10 waveforms are coded by ver~ different coding techniques. Speech coding, such as telephone-bandwidth (3.4 kHz) speech coding at or below 16 kb/s, has been dGmi.,aled by time domain predictive coders. These coders use speech production models to predict speech waveforms to be coded.
Predicted waverur")s are then subtrac~ed from the actual (original) waveforms (to be coded) to reduce redundancy in the original signal. Reduction in signal redundancy provides coding gain. Examples of such predictive speech coders include Adaptive Preclictive Coding, Multi-Pulse Linear Predictive Coding, and Code-F~cite,l Linear Prediction (CELP) Coding. all well known in the art of speech signal com,r)-~ssio".
2 o On the other hand, wideband (0 - 20 kHz) music coding at or above 64 kb~s has been dominated by Frequency-domain transroml or sub-band coders.
These music coders are fundamenlally very different from the speech coders discl ~ssed above. This diffen3nce is due to the fact that the sources of music,unlike those of speech, are tc~o varied to allow ready prediction.
25 Consequently, models of music sources are generally not used in music coding. Instead, music coders use elaborate human hearing models to code only those parts of the signal that are perceptually relevant. That is, unlike speech coders which cor"l~,oilly use speech production models, music coders employ hearing -- sound rec6~pfion -- models to obtain coding gain.
In music coders, hearing models are used to determine a noise masking capability of the music to be coded. The term "noise masking capabilit,v" refers to how much quantization noise can be introduced into a CA 022193~8 1997-10-24 W O 97/31367 PCT~US97/02898 music signal without a listener noticing the noise. This noise masking capability is then used to set quantizer resolution (e g., quantizer stepsize).
Generally, the more "tonelike" music is, the poorer the music will be at masking qual,li dlion noise and, therefore, the smallerthe required stepsize 5 will be, and vice versa. Smaller stepsi~es correspond to smaller coding gains, and vice versa. Examples of such music coders include AT&T's Perceptual Audio Coder ~PAC) and the ISO MPEG audio coding standard.
In between telephone-bandwidth speech coding and wideband music coding, there lies wideband speech coding, where the speech signal is 10 sampled at 16 kHz and has a bandwidth of 7 kHz. The advanldye of 7 kHz ~i~leband speech is that the resulting speech quality is much better than telephone-bandwidth speech, and yet it requires a much lower bit-rate to code than a 20 kHz audio signal. Among those previously proposed wideband speech coders, some use time-domain predictive coding, some use 15 frequency-domain transform or sub-band coding, and some use a mixture of time-domain and frequency-domain techniques.
The inclusion of perceptual criteria in predictive speech coding, ~ideband or otherwise, has been limited to the use of a perceptual weighting filter in the context of selecting the best synthesi~ed speech signal from 20 among a plurality of candidate synthesi~erl speech signals. See, e.g., U.S.
Patent No. Re. 32,580 to Atal et a/. Such filters accomplish a type of noise shaping which is useful reducing noise in the coding process. One known coder dllelll,uts to improve upon this technique by employing a perceptual model in the fol~r~dli~ll of that perceptual weighting filter.

W ~97131367 PCT~S97~02898 Summary of the Invention The efforts descri~ed above not~ilhsta~ ,.3illg, none of the known speech or audio coders utiliz:es both a speech prod~ction model for signal predic~ion purposes and a hearing model to set quantizer resolution according - ~ to an analysis of signal nois~ masking capability.
The present invention, on the other hand, combines a predictive coding system with a qua"li~lion process which quantizes a signal based on a noise masking signal determined with a model of human auditory sensitivity to noise. The output of the preidictive coding system is thus quantized with a 10 quantizer having a resolutiorl (e.g., stepsize in a uniform scalar quantizer, or the number of bits used to identify vectors in a vector quantizer) which is a function of a noise l-,ashir,g signal determi.led in accordance with a audio perceptuai model.
According to the invention, a signal is generated which represents an 15 e~limdte (or prediction) of a ~;ignal r~presel,ling speech information. The term "original signal representing speech information" is broad enough to refer not only to speech itself, but also to speech signal derivatives commonly found in speech coding systems (such as linear pr~di~;lion and pitch prediction residual signals). The esli".~te signal is then compared to the original signal 20 to form a signal representing the dirr~r~"ce between said compared signals.
This signal represenli"~ the difference between the compared signals is then qua"li,ed in accordance with a perceptual noise masking signal which is generated by a model of human audio perception.
An illustrative embodiment of the present invention, referred to as 25 "Transform Predictive Codin~", or TPC, encodes 7 kHz wideband speech at a target bit-rate of 16 to 32 kb/s. As its name i",~l es, TPC combines transform coding and predictive coding techniques in a single coder. More specifically, the coder uses linear prediction to remove the redundancy from the input speech waveform and then use transform coding techniques to encode the 30 resulting prediction residual. The l,d"~rur"~ed prediction residual is quantized based on knowledge in human auditory perception, expressed in terms of a .

W O 97131367 PCT~US97/0289X
auditory perceptual model, to encode what is audible and disoard what is inaudible.
One important feature of the illustrative embodiment concerns the way in which perceptual noise masking capability (e.g., the perceptual threshold 5 of "just noticeable distortion") of the signal is determined and subsequent bit ~'lcc~lion is performed. Rather than determining a perceptual threshold using the unquantized input signal, as is done in conventional music coders, the noise masking threshold and bit allocation of the embodiment are determined based on the frequency response of a quantized sy"ll,esis filter-- in the 10 embodiment, a quantized LPC synthesis filter. This feature provides an advantage to the system of not having to communicate bit ~llocation signals, from the encoder to the decoder, in order for the decoder to replicate the perceptual threshold and bit allocation processing needed for decoding the received coded wideband speecl) i"rcr",aLion. Instead, synthesis filter coefflcients, which are being communicated for other purposes, are ~xploit~
to save bit rate.
Another important feature of the illustrative embodiment concerns how the TPC coder alloc~l~s bits among coder frequencies and how the decoder generates a quantized output signal based on the allocated bits. In certain 20 circumstances, the TPC coder ~loc~~~s bits only to a portion of the audio band (for example, bits may ~e allocated to coerfici~ behNeen 0 and 4 kHz, only). No bits are ~lloc~tP,d to represent coe~ficie.lls between 4 kHz and 7 kHz and, thus, the decoder gets no coerrici~nl~ in this frequency range. Such a circumstance occurs when, for example, the TPC coder has to operate at 25 very low bit rates, e.g., 16 kb/s. Despite having no bits representing the coded signal in the 4 kHz and 7 kHz frequency range, the decoder rnust still synthesize a signal in this range if it is to provide a wideband response.
Accor~Ji"g to this feature of the embodiment, the decoder generates - that is, synthesizes - coefficient signals in this range of frequencies based on other 30 available information - a ratio of an esli",dle of the signal spectrum (obtained from LPC par~",eLers~ to a noise masking threshold at frequencies in the range. Phase values for the coefficients are selected at random. By virtue of -W O 97t31367 PCT~US97/02898 this technique, the decoder can provide a wideband les~o"se without the need to ll~nsmil speech signal coefficients for the entire band.
The potential arplio~lions of a wideband speech coder include ISDN
video-conferencing or audio-conferencing, multimedia audio, "hi-fi" telephony, - 5 and simultaneous voice and data (SVD) over dial-up lines using modems at 28.8 kb/s or higher.

E~rief Des~ri,~ ., of the Dl~v~ y~
Figure 1 presents an illustrative coder embodiment of the present 10 invention.
Figure 2 presents an illustrative decoder embodiment of the present invention.
Figure 3 presents a de~tailed block diagram of the LPC parameter processor of Figure 1.

15 Detailed De~cri~.tion A. Overview of the Illus~trative Embodiments For clarity of explanation; the illustrative embodiment of the present invention is pr~se"led as comprising individual functional blocks (including functional blocks labeled as ",~.rocessors"). The functions these blocks 20 represent may be provided tl1rough the use of either shared or dedicated hardware, including, but not limited to, hardware capable of executing software. For example; the functions of proc-Pssors ~rase~ d in Figures 1 to 4 may be provided by a single shared plucessor. (Use of the term "processor" should not l~e construed to refer exclusively to hardware 25 capable of executing software.) Illustrative embodiments may comprise digital signal processor (DSP) hardware, such as the AT&T DSP16 or DSP32C, read-only memory (ROM) for storing software W O 97/31367 PCT~US97/02898 performing the operations discussed below, and random access memory (RAM) for storing DSP results. Very large scale integration (VLSI) hardware embodiments, as well as custom VLSI circuitry in combination with a general purpose DSP circuit, may also be provided.
In accordance with the present invention, the sequence of digital input speech samples is partitioned into consecutive 20 ms blocks called frames, and each frame is further subdivided into 5 equal subframes of 4 ms each.
Assuming a sampling rate of 16 kHz, as is common for wideband speech signals, this corresponds to a frame size of 320 samples and a subframe size 10 of 64 sam,;l2s. The TPC speech coder buffers and processes the input speech signal frame-by-frame, and within each frame certain encoding operations are performed subframe-by-sul,rld,.,e.
Figure 1 presents an illustrative TPC speech coder embodiment of the present invention. Refer to the embodiment shown in Figure 1. Once every 15 20 ms frame, the LPC parameter processor 10 derives the Line Spectral Pair (LSP) parar"eters from the input speech signal s, qua"li~es such LSP
parameters, interpolates them for each 4 ms sul~dme, and then converts to the LPC predictor coerl~ciel,t array a for each su~r~a".e. Short-term redundancy is removed from the input speech signal, s, by the LPC prediction 20 error filter 20. The resulting LPC prediction residual signal, d, still has some long-term redundancy due to the pitch periodicity in voiced speech. The shaping filter coefficient processor 30 derives the shaping filter coer~ci~nts awc from quantized LPC filter coerr,cieilts a. The shaping filter 40 filters theLPC precli~lion residual signal d to produce a perceptually weighled speech 25 signal sw. The zero-input response processor 50 c:~'cul~tes the zero-input response, zir, of the shaping filter. The subtracting unit 60 then subtracts zirfrom sw to obtain fp, the target signal for pitch prediction.
The open-loop pitch extractor and interpolator 70 uses the LPC
prediction residual d to extract a pitch period for each 20 ms frame, and then 3 o calcul~tes the interpolated pitch period kpi for each 4 ms sub-frame. The closed-loop pitch tap quantizer and pitch predictor 80 uses this interpolated pitch period kpi to select a set of 3 pitch predictor taps from a codebook of CA 022193~8 1997-10-24 W O 97/31367 PCT~US97/02898 candidate sets of pitch taps. The selection is done such that when the previously quantized LPC residual signal dt is filtered by the corresponding 3-tap pitch synthesis filter and then by a shaping filter with zero initial memory, the output signal hd is closest to the target signal fp in a mean-square error 5 (MSE) sense. The suLtl~c~in~ unit 90 S~ cls hdfrom tp to obtain ft, the target signal for transform coding.
The shaping filter magnitude response processor 100 calc~ tes the signal mag, the magnitude ol the frequency response of the shaping filter.
The l.a"srorm processor 110 performs a lineartransform, such as Fast 10 Fourier Transform (FFT), on lthe signal ff. Then, it normalizes the transformcoefficients using mag and the quantized ver~ions of gain values which is calculated over three ~lirre~enl frequency bands. The result is the normalized transform co~rri..ient signal tG'. The transform coefficient quantizer 120 then quantizes the signal tc using the adaptive bit allocation signal ba, which is 15 determined by the hearing m~del quantizer control processor 130 according to the time-varying perceptual importance of transform coefficients at different frequencles.
At a lower bit-rate, such as 16 kb/s, processor 130 only ~lloc~te5 bits to the lower half of the frequency band (0 to 4 kHz). In this case, the high-20 frequency synthesis processor 140 synthesizes the l~d,~rùrm coefficients inthe high-frequency band (4 to 8 kHz), and combine them with the quanLi ed low-frequency l~dnsform coefhcient signal dtc to produce the final quantized full-band tr~"~rur." coerricie"L signal qtc. At a higher bit-rate, such as 24 or32 kb/s, each l~dnsrorm coefficient in the entire frequency band is allowed to 25 receive bits in the adaptive bit allocation process, although coerricienls may eventually receive no bits at all due to the scarcity of the available bits. In this case, the high-frequency synlhesis processor 140 simply detects those frequencies in the 4 to 8 kHz band that receive no bits, and fills in such "spectral holes" with low-level noise to avoid a type of "swirling" distortion 3 o t,vpically found in adaptive transform coders.
The inverse transform processor 150 takes the quantized transform coerr,cient signal qtc, and applies a linear l,d"srorm which is the inverse = ~
CA 022193~8 1997-10-24 W O 97/31367 PCT~US97/02898 operation of the linear transform employed in the transform processor 110 (an inverse FFT in our particular illustrative embodiment here). This results in a time-domain signal qff, which is the quantized version of tt, the target signal for transform coding. The inverse shaping filter 160 then filters qff to obtain 5 the quantized e~ccit~tion signal et. The adder 170 adds etto the signal dh (which is the pitch-predicted version of the LPC pre.li~liGn residual o~
produced by the pitch predictor inside block 80. The resulting signal dt is the quantized version of the LPC prediction residual d. It is used to update the filter memory of the shaping filter inside the zero-input response processor 50 10 and the memory of the pitch predictor inside block 80. This completes the signal loop.
Codebook indices representing the LPC predictor parameters (IL), the pitch predictor pardr, leter~ (IP and IT), the transform gain levels (IG), and the quantized transform cot:rricie.,Ls (IC) are multiplexed into a bit ~ dlll by themultiplexer 180 and tral lsr ,illed over a channel to a decoder. The channel may comprise any suitable communication channel, including wireless channels, computer and data networks, telephone networks; and may include or consist of memory, such as, solid state memories (for example, semiconductor memory), optical memory systems (such as CD-ROM), 2 o magnetic memories (for example, disk memory), etc.
Figure 2 presents an illustrative TPC speech decoder embodiment of the present invention. The demultiplexer 200 separates the codebook indices IL, /P, IT, IG, and IC. The pitch decoder and interpolator 205 decodes IP and calculates the inLerpolal~d pitch period kpi. The pitch tap decoder and pitch 25 predictor 210 decodes ITto obtain the pitch predictor taps array b, and it also c~lcll~ates the signal dh, or the pitch-predicted version of the LPC prediction residual d. The LPC parameter decoder and interpolator 215 decodes IL and then c~lcl ~lates the interpolated LPC filter coe~ficient array a. Blocks 220 through 255 perform exactly the same operations as their counterparts in 3 o Figure 1 to produce the quantized LPC residual signal dt. The long-term posLrill~r 260 enhances the pitch periodicity in dt and produces a filtered version fdt as its output. This signal is passed through the LPC synthesis CA 022l9358 l997-l0-24 W O97f31367 PCTAUS97/02898 filter 265, and the resulting signal st is further filtered by the short-term postFilter 270, which produce s a final filtered output speech signal fst.
To keep the co",plexi1~ low, open-loop quar,li,dlion is employed by the TPC as much as poss~ . Open-loop quantization means the quantizer 5 aller"~ls to minimize the dirr~,r~nce between the unquantized parameter and its quantized version, withou t regard to the effects on the output speech quality. This is in contrast to, for example, CELP coders, where the pitch predictor, the gain, and the excitation are usually close-loop quantized. In closed-loop quanli,dliGn of a coder parameter, the quantizer codebook 10 search dllen,pts to minimize the distortion in the final reconstructed outputspeech. Naturally, this gener,ally leads to a better output speech qùality, but at the price of a higher codebook search complexity.
In the present invention, the TPC coder uses closed-loop quantization only for the 3 pitch predictor 1aps. The qua~ alion operations leading to the 15 quantized excil~lion signal et is basically similar to open-loop qua"li~dlion, but the effects on the output speech is close to that of closed-loop quanli~-~lion. This approach is similar in spirit to the approach used in the TCX coder by Lefebvre et. al, " High Quality Coding of Wideband Audio Signals Using Transform Coaled F~Cit~tion (TCX)", Proc. IEEE International 20 Conf. Acoustics, Speech, Signal rl~cessing, 1994, pp. 1-193 to 1-196, although there are also i"~pollanl differences. For example, the features of the current invention that are not in the TCX coder include normalization of the transform coefficients by a shaping filter magnitude response, adaptive bit allocation collll~ll3~ by a hearing model, and the high-frequency synthesis 25 and noise fill-in procedures.

B. An Illusl,~ re Coder Embodiment 1. Ti~e LPC Analysis and Predicfion A detailed block diagram of LPC parameter processor 10 is presented in Figure 3. Processor 10 comprises a windowing and autocorrelation 3 o processor 310; a spectral smoothing and white noise correction processor CA 022193~8 1997-10-24 W O 97131367 PCT~US97/02898 315; a Levinson-Durbin recursion processor 320; a bandwidth expansion processor 325; an LPC to LSP conversion processor 330; and LPC power spectrum processor 335; an LSP quantizer 340; an LSP sorting processor 345; an LSP interpolation processor 350; and an LSP to LPC conversion 5 processor 355.
Windowing and autocorrelation processor 310 begins the process of LPC coefficient generation. Processor 310 generates aulocorrelation coefficients, r, in conventional fashion, once every 20 ms from which LPC
coefficients are s~ ~seciuently computed, as discussed below. See Rabiner, lO L R. et al., Digital Processing of Speech Signals, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1978 (Rabiner et a/.). The LPC frame size is 20 ms (or 320 speech samples at 16 kHz sampling rate). Each 20 ms frame is further divided into 5 subframes, each 4 ms (or 64 samples) long. LPC
analysis processor uses a 24 ms Hamming window which is centered at the last 4 ms subframe of the current frame, in conventional fashion.
To alleviate potential ill-conditioning, certain conventional signal conditioning te~l,r,i~iues are employed. A spectral smoothing technique (SST) and a white noise correction tech"iciue are applied by spectral sn,Gc,ll,L,g andwhite noise co,.~clion processor 315 before LPC analysis. The SST, well-20 known in the art (Tohkura, Y. ef al., "Spectral Smoothing Technique inPARCOR Speech Analysis-Synthesis," IEEE Trans. Acoust., Speech, Signal Processing, ASSP-26:587-596, Decel"ber 1978 (Tohkura etal.)) involves multiplying an calcnl~te~ a- locor,t:lation coefficient array (from processor 310) by a C'~IJssi~n window whose Fourier l,ansfc,rm corresponds to 25 probability density function (pd~ of a Gaussian distribution with a standard deviation of 40 Hz. The white noise correction, also conventional (Chen, J.-H., "A Robust Low-Delay CELP Speech Coder at 16 kbiVs, Proc. IEEE Giobal Comm. Conf., pp. 1237-1241, Dallas, TX, November 1989.), increases the zero-lag autocorrelation coefficient (i.e., the energy term) by 0.001%.
The coefficients generated by processor 315 are then provided to Levinson-Durbin recursion processor 320, which generates 16 LPC

CA 022l9358 l997-l0-24 W O9~t31367 pcTAJss7/o28s8 coefficients, a, for i=1,2,...,1~; (the order of the LPC prediction error filter 20 is 16) in conventional fashion.
Bandwidth expansion processor 325 multiplies each a; by a factor ~, where g=0.994, for further signal condilio"i"g. This corresponds to a bandwidth 5 expansion of 30 Hz. (Tohkura et al.).
A~ter such a bandwidth expansion, the LPC predictor coefficients are converted to the Line Spectral Pair (LSP) coeflicients by LPC to LSP
conversion processor 330 in conventional fashion. See Soong, F. K. et al., "Line Spectrum Pair (LSP) and Speech Data Compression," Proc. IEEE Int.
10 Conf. Acoust., Speech, Signal Plocessi"g, pp. 1.10.1-1.10.4, March 1984 ~Soong ef a/.), which is incorporated by reference as if set ~orth fully herein.
Vector quanli~lion (\~Q) is then provided by LSP quantizer 340 to quantize the resulting LSP coefficients. The specific VQ technique employed by processor 240 is similar to the split VQ proposed in Paliwal, K. K. et al., "Efficient Vector Quanli~ io n of LPC Par~l"eters at 24 bits/frame," Proc.
IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 661-664, Toronto, Canada, May 1991 (Paliwal ,ef al.), which is incorporated by reference as if set forth fully herein. The 1 6-di" ,ensional LSP vector is split into 7 smallersub-vectors having the dimelnsions of 2, 2, 2, 2, 2, 3, 3, counting from the low-2 o frequency end. Each of the 7 sub-vectors are quantized to 7 bits (i.e., using a VQ codebook of 128 codevectors). Thus, there are seven codebook indices, IL(1 ) - IL~7), each index being seven bits in length, for a total of 49 bits per frame used in LPC para",~:ler qual,Li~lion. These 49 bits are provided to the multiplexer 180 for trans" ,ission to the decoder as side information.
Processor 340 perfonns its search through the VQ codebook using a conventional weighted mean-square error (WMSE) distortion measure, as described in Paliwal et al. T he LPC power spectrum processor 335 is used to c~lclll~t~ the weights in this WMSE distortion measure. The codebook used in processor 340 is dec,igned with conventional codebook generation 3 o te~hniques well-known in the art. A conventional MSE distortion measure CA 022193~8 1997-10-24 W O 97/31367 PCTrUS97/02898 can also be used instead of the WMSE measure to reduce the coder's cGnl~lexity without significant degr~aliol) in the output speech quality.
Normally LSP coerficients monotonically increase. However, quanli~liGn may result in a disruption of this order. Thls disruption results in5 an unstable LPC synthesis filter in the decoder. To avoid this problem, the LSP sorting processor 345 sorts the quantized LSP coerriGie~ It& to restore the monotonically increasing order and ensure stability.
The quantized LSP coefficients are used in the last suL r~a,~,e of the current frame. Linear interpolation between these LSP coefficients and those 10 from the last subframe of the previous frame is performed to provide LSP
coerl;c.ents for the first four s~lJrl ~I ~ ,es by LSP interpolation processor 350, as is conventional. The interpolated and quantized LSP coefficients are then converted back to the LPC predictor coefficients for use in each subframe by LSP to LPC conversion processor 355 in conventlonal fashion. This is done 15 in both the encoder and the decoder. The LSP interpolation is important in maintaining the smoulll reproduction of the output speech. The LSP
interpolation allows the LPC predictor coefficients to be updated once a subframe (4 ms) in a sr"oull, fashion. The resulting LPC predictor coerricie. ,1array a are used in the LPC prediction error filter 20 to predict the coder's 2 o input signal. The difference between the input signal and its predicted version is the LPC prediction residual, d.

2. Shaping Filter The shaping filter coefficient processor 30 computes the first three autoco"~lalion coerricie"l:j of the LPC predictor coefficient array a, then uses25 the Levinson-Durbin recursion to solve for the coefficients c" j = 0,1,2 for the corresponding optimal second-order all-pole predictor. These predictor coefficients are then bandwidth-expanded by a factor of 0.7 (i.e. the j-th coefficient Cj is replaced by cj(0.7)j ). Next, processor 30 also pe, runns bandwidth expansion of the 1 6th-order all-pole LPC predictor coerricient array 30 a, but this time by a factor of 0.8. Cascading these two bandwidth-expanded .
CA 022l9358 l997-l0-24 W O 9713~367 PCTAUS97/02898 all-pole filters (2nd-order and 16th-order) gives us the desired 18th-order shaping filter 40. The shaping filter coerricienl array awc is calculated by convolving the two bandwidlh-expanded coefficient arrays (2nd-order and 16th-order) mentioned above to !3et a direct-form 1 8th-order filter.
When the shaping filter 40 are Gascaded with the LPC prediction error filter, as is shown in Figure 1, the two filters effectively form a perceptual weighting filter whose frequency response is roughly the inverse of the desired coding noise spectnlm. Thus, the output of the shaping filter 40 is called the perceptually weighted speech signal sw.
The zero-input response processor 50 has a shaping filter in it. At the beginning of each 4 ms sul~lldllle, it performs shaping filtering by feeding thefilter with 4 ms worth of zero input signal. In general, the corresponding output signal vector zir is non-zero because the filter generally has non-zero memory (except during the \~ery first subframe after coder initi~ tion, or 15 when the coder's input signall has been exactly zero since the coder starts up3. Processor 60 subtracts zirfrom the weighted speech vector sw; the resulting signal vector tp is tlle target vector for closed-loop pitch prediction.

3. r~lo~ J-Loop Pifch Predlction 2 o There are two kinds oF pa~ ~mel~r~ in pitch prediction which need to be quantized and l~d"sl.litted to the decoder. the pitch period cGr,esponding to the period of the nearly periodic waveform of voiced speech, and the three pitch predictor coerricienl:j (taps).

25 a~ Pifch Perîod The pitch period of the LPC prediction residual is determined by the open-loop pitch extractor and interpolator 70 using a modified version of the efficient two-stage search technique dicc~ ~ssed in U.S.Patent No. 5,327,520, entitled "Method of Use of Voice Mess~e Coder/Decoder," and incorporated 3 o by reference as if set forth fully herein. Processor 70 first passes the LPCresidual through a third-order elliptic lowpass filter to limit the bandwidth toabout 700 Hz, and then performs 8:1 dec;"~lion of the lowpass filter output.

CA 022193~8 1997-10-24 W O97~31367 PCT~US97/~2898Using a pitch analysis window corresponding to the last 3 subframes of the current frame, the correlation coefficients of the deci")aLed signal are calculated for time lags ranging from 3 to 34, which correspond to time lags of 24 to 272 samples in the undeci.ndl~d signal domain. Thus, the allowable 5 range for the pitch period is 1.5 ms to 17 ms, or 59 H~ to 667 Hz in terms of the pitch frequency. This is sufficient to cover the normal pitch range of most speakers, including low-pitched males and high-pitched children.
After the correlation coerlicie. IIS of the decimated signal are c~culated, the first major peak of the correlation coefficients which has the lowest time 10 lag is identified. This is the first-stage search. Let the resulting time lag be t.
This value t is multiplied by 8 to obtain the time lag in the undec;" ,ated signal domain. The resulting time lag, 8t, points to the neighborhood where the true pitch period is most likely to lie. To retain the original time resolution in the undeci"~dled signal domain, a second-stage pitch search is conducted in the 15 range of t~ to t+4. The correlation coefficients of the original undecimated LPC residual, d, are c~lGulate~l for the time lags of t4 to t+4 (s' ~ect to the lower bound of 24 samples and upper bound of 272 samples). The time lag corresponding to the maximum corlt:ldlioll coef~lcient in this range is then identified as the final pitch period. This pitch period is encoded into 8 bits and 20 the 8-bit index IP is provided to the multiplexer 180 for l,~nsn,ission to thdecoder as side i"ror",~lion. Eight bits are sufFicient to represent the pitch period since there are only 272-24+1=249 possible integers that can be selected as the pitch period.
Only one such 8-bit pitch index is trans",iLIecl for each 20 ms frame.
2 5 Processor 70 determines the pitch period kpi for each subframe in the following way. If the difference between the extracted pitch period of the current frame and that of the last frame is greater than 20%, the extracted pitch period described above is used for every subframe in the current frame.
On the other hand, if this relative pitch change is less than 20%, then the 30 extracted pitch period is used for the last 3 subfi ~l "es of the current frame, while the pitch periods of the first 2 subr,an.es are obtained by a linear W O 97131367 PCT~US97/02898 interpolation behNeen the e~dtracted pitch period of the last frame and that of the current frame.

b. Pitch Predictor Taps - 5 The closed-loop pitch tap quantizer and pitch predictor 80 performs the following operations s~.~r,~",e-by-subframe. (1) closed-loop quanLi,dlion of the 3 pitch taps, (2) generation of dh, the pitch-predicted version of the LPC
prediction residual d in the current subframe, and (3~ generation of hd, the closest match to the target signal fp.
Processor 80 has an internal buffer that stores previous samples of the signal dt, which can be regarded as the quantized version of the LPC
prediction residual d. For earh subr dn,e, processor 80 uses the pitch period kpi to extract three 64-dimensional vectors from the dt buffer. These three vectors, which are called xl"~2, and x3, are respectively kpi - 1, kpi, and kpi 15 ~ 1 samples earlier than the current frame of dt. These three vectors are then separately filtered by a shaping hlter (with the coerr,ciEnL array awc) which has zero initial filter memory. Let's call the resulting three 64-dimensional output vectors YI~Y2~ and y3, Next, processor 80 needs to search through a codebook of 64 candidate sets of 3 pitch predictor taps 20 bl~,b2j,b3j,j=l~2~ 64~ and find the optimal set hlk~b2k~b3k which minimizes the ~islul liol, measure bikyill2.
This type of problem has been studied before, and an efficient search method can be found in U.S.Patent No. 5,327,520. While the de~ails of this technique 25 will not be presented here, the basic idea is as follows.
It can be shown that rrlir,;",i~i"g this dislullion measure is equivalent to maximizing an inner product of two 9-dimensional vectors. One of these 9-dimensional vectors contains only correlation coefficients Of Yl,Y2, and y3.
The other 9-dimensional vect~r contains only the product terms derived from 30 the set of three pitch predictor taps under evaluation. Since such a vector is signal-independent and depends only on the pitch tap codevector, there are .

W O 97/31367 PCTrUS97/02898 only 64 such possible vectors (one for each pitch tap codevector), and they can be pre-computed and stored in a tabl~the VQ codebook. In an actual codebook search, the 9-dimensional correlation vector of Yl,Y2, and y3 iS
c~lcu'~ted first. Next, the inner product of the resulting vector with each of the 5 64 pre-computed and stored 9-di"~ensional vectors is calculated. The vector in the stored table which gives the maximum inner product is the winner, and the three quantized pitch predictor taps are derived from it. Since there are 64 vectors in the stored table, a 6-bit index, IT(m~ for the m-th subframe, is sufficient to represent the three quantized pitch predictor taps. Since there 10 are 5 subframes in each frame, a total of 30 bits per frame are used to represent the three pitch taps used for all subr, d,-,es. These 30 bits are provided to the multiplexer 180 for transmission to the decoder as side information.
~ or each subfra",e, after the optimal set of 3 pitch taps blk,b2k,b3k are selected by the codebook search method outlined above, the pitch-predicted version of d is calcl ~late~l as dh = ~b&xj .
i=l The output signal vector hd is c~lcul~ted as hd = ~bjkyj i=l 20 This ~lector hd is subtracted from the vector tp by the subtracting unit 90. The result is ff, the target vector for transform coding.
4. Transform Coding of the Target Vector a. Shaping Filfer Magnitude Response for Norma~ization The target vector tt is encoded subframe-by-subframe by blocks 100 through 150 using a transform coding approach. The shaping filter magnitude response processor 100 calcul~tes the signal mag in the following way. First, it takes the shaping filter coefficient array awc of the last subframe of the 3 o current frame, zero-pads it to 64 samples, and then performs a 64-point FFT

-on the resulting 64-dimensional vector. Then, it calcl ~'at~s the magnitudes of the 33 FFT co~rri cnts whiclh coll~spcl,d to the frequency range of 0 to 8 kHz. The result vector mag is the magnitude response of the shaping filter for the last subframe. To save computation, the mag vectors for the first four 5 subr,~ll,es are obtained by a linear interpolation between the mag vector of the last subframe of the last frame and that of the last subframe of the currentframe.
b. ~ransform and Gain No",.~ afion The transform processor 110 performs several operations, as 10 described below. It first l~ans~n~ls the 64-dimensional vector tt in the current subframe by using a 64-point FFT. This transform size of 64 samples (or 4 ms) avoids the so-called "pr~-echo" distortion well-known in the audio coding art. See ~ayant, N. et al., "Slgnal Compression Based on Models of Human Perception," Proc. IEEE, pp. 1385-1422, October 1993 which is incorporated 15 by r~fer~nce as if set forth fully herein. Each of the first 33 complex FFT
coerficie"L:i is then divided by the col,esponding element in the mag vector.
The resulting normalized FFT coerricie.,l vector is partitioned into 3 frequencybands: (1) the low-frequency band consisting of the first 6 normalized FFT
coefficients (i.e. from 0 to 12!~0 Hz), (2) the mid-frequency band consisting of20 the next 10 normalized FFT ~:oefficients (from 1500 to 3750 Hz), and (3~ the high-frequency band consisting of the remaining 17 normalized FFT
coefri~,ients (from 4000 to 801~0 Hz).
The total energy in each of the 3 bands are c~lcl ll~t~d and then converted to dB value, called the log gain of each band. The log gain of the 25 low-frequency band is quantized using a 5-bit scalar quantizer designed usingthe Lloyd algorithm well known in the art. The quantized low-frequency log gain is subL,e~t~d from the log gains of the mid- and hlgh- frequency bands.
The resulting Icvcl adjusted rnid- and high-frequency log gains are concatenated to form a 2-dimensional vector, which is then quantized by a 7-30 bit vector quantizer, with a codebook designed by the generalized Lloydalgorithm, again well-known in the art. The quantized low-frequency log gain is then added back to the quantized versions of the Icvcl adjusted mid- and W O 97/31367 PCT~US97/02898 high-frequency log gains to obtain the quantized log gains of the mid- and high-frequency bands. Next, all three quantized log gains are converted from the logarithmic (dB) domain back to the linear domain. Each of the 33 normalized FFT coefficients (normalized by mag as described above) is then 5 further divided by the cor,es,vo"ding quantized linear gain of the frequency band where the FFT coefficient lies in. After this second stage of normalization, the result is the final normalized transform coerl~cient vector tc, which contains 33 complex numbers representing frequencies from 0 to 8000 Hz.
During the quan~ tiol) of log gains in the m-th S1JI r,dnle, the llallsrollll processor 110 produces a 5-bit gain codebook index /G(m,1) for the low-frequency log gain and a 7-bit gain codebook index /G(m,2) for the mid- and high-frequency log gains. Therefore, the 3 log gains are encoded at a bit-rate of 12 bits per subframe, or 60 bits per frame. These 60 bits are 15 provided to the multiplexer 180 for transmission to the decoder as side infor~ldlion. These 60 gain bits, along with the 49 bits for LSP, 8 bits for thepitoh period, and 30 bits for the pitch taps, form the side i"fior",dliG", whichtotals 49 ~ 8 + 30 + 60 = 147 bits per frame.

c. The B,t Stream As described above, 49 bits/frame have been .~"~c~ted for encoding ~PC parameters, 8+(6x5)=38 bits/frame have been ~'lcc~tecl for the 3-tap pitch predictor, and (5+7)x5=60 bits/frame for the gains. Therefore, the total number of side inror"~dlion bits is 49~38+60=147 bits per 20 ms frame, or roughly 30 bits per 4 ms subr,dri,e. Consider that the coder might be used at one of three different rates: 16, 24 and 32 kb/s. At a sampling rate of 16 kHz, these three target rates translate to 1, 1.5, and 2 bits/sample, or 64, 96, and 128 bits/subframe, respectively. With 3û bits/subframe used for side i"~or",dlio", the numbers of bits remaining to use in encoding the main information (encoding of FFT coefficients~ are 34, 66, and 98 bits/subframe 3 o for the three rates of 16, 24, and 32 kb/s, respectively.

-W ~97131367 PCTnUS97/02898d. Adaptive BitAllocatiolr In accordance with the principles of the present invention, adaptive bit allocation is performed to as,sign these remaining bits to various parts of the frequency spectrum with difierent quantization accuracy, in order enhance the 5 percepfual quafify of the outout speech at the TPC decoder. This is done by using a model of human ser~sitivity to noise in audio signals. Such models are known in the art of perceptual audio coding. See, e.g., Tobias, J. V., ed., Foundations of Modern Audltorv Theory, Academic Press, New York and London, 1970. See also Schroeder, M. R. et al., "Optimizing Digital Speech 10 Coders by Exploiting Masking Properties of the Human Ear," J. Acoust. Soc.
Amer., 66:1647-1652, December 1979 (Schroeder, etal.), which is hereby incorporated by reference as if fully set forth herein.
Hearing model and q~a~Li~er control processor 130 performs adaptive bit allocation and generate an output vector ba which tells the l,a"sror", coerlicie"l quantizer 120 ho\iu many bits should be used to quantize each of the 33 normalized l,d-,:jrorm coefficients contained in tc. While adaptive bit ~ ~IO~;GII mi~ht be performed once every subframe, the illustrative embodiment of the present illvention performs bit allocation once per frame in order to reduce computational complexity.
Rather than using the unquantized input signal to derive the noise maskin~ threshold and bit -~local;ol-, as is done in conventional music coders, the noise masking threshold and bit allocation of the illustrative embodiment are determined from the frequency response of the quantized LPC synthesis filter (which is often referred to as the "LPC spectrum"). The LPC spectrum 25 can be considered an appro~imation of the spectral envelope of the input signal within the 24 ms LPC analysis window. The LPC spectrum is determined based on the quantized LPC coefficients. The quantized LPC
coefficients are provided by l:he LPC parameter processor 10 to the hearing model and quantizer control processor 130, which determines the LPC
30 spectrum as follows. The quanli~ed LPC filter coefficients a are first lldnsrormed by a 64-point FF:T. The power of each of the first 33 FF
coefficients is determined and the reciprocal is then calculated. The result is W O97/31367 PCT~US97/02898the LPC power spectrum which has the frequency resolution of a 64-point ~FT.
After the LPC power spectrum is delellni.led, an estimated noise masking threshold, TM. jS calcl~lated using a modified version of the method 5 described in U.S. Patent No. 5,314,457, which is i"co",orated by reference as if fully set forth herein. Processor 130 scales the 33 samples of LPC power spectrum by a frequency-dependent attenuation function empirically determined from subjective listening experiments. The attenuation function starts at 12 dB for the DC term of the LPC power spectrum, i"creases to 10 about 15 dB between 700 and 800 Hz, then decreases monotonically toward high frequencies, and finally reduces to 6 dB at 8000 Hz.
Each of the 33 attenuated LPC power spectrum samples is then used to scale a "basilar membrane spreading function" derived for that particular frequency to c~lcul~te the masking threshold. A spreading function for a 15 given frequency corresponds to the shape of the masking threshold in response to a single-tone masker signal at that frequency. Equation ~5) of Schroeder, et al. describes such spr~adil~g functions in terms of the "bark"
frequency scale, or critical-band frequency scale is incorporated by reference as if set forth fully herein. The scaling process begins with the first 33 20 frequencies of a 64-point FFT (i.e., 0 Hz, 250 Hz, 500 Hz, . . ., 8000 Hz) being converted to the "bark" frequency scale. Then, for each of the 33 resulting bark values, the corresponding spreading function is sampled at these 33 bark values using equation (5) of Schroeder et al. The 33 resulting spreading functions are stored in a table, which may be done as part of an 25 off-line process. To calculate the esli"~dled masking threshold, each of the 33 spreading functions is multiplied by the corresponding sample value of the attenuated LPC power spectrum, and the resulting 33 scaled spreading functions are summed together. The result is the esli"~ated masking threshold function. It should be noted that this technique for esLilnaLi"g the 3 o masking threshold is not the only technique available.
To keep the complexity low, processor 130 uses a "greedy" algorithm to perform adaptive bit ~"acation. The technique is "greedy" in the sense that -~o--CA 022193~8 1997-10-24 W O 97/31367 PCT~US97/02898 it~l'2 --t~s one bit at a time lo the most "needy" frequency component without regard to its potential influence on future bit allocation. At the beginning when no bit is assigned yet, the corresponding output speech will be zero, and the ~oding error signal is the input speech itself. Therefore, initially the LPC
5 power spectrum is assumed to be the power spectrum of the coding noise.
Then, the noise loudness at i'3ach of the 33 frequencies of a 64-point FFT is estimated using the masking threshold c~lcl ~ated above and a simplified version of the noise loudnes~; calculation method in Schroeder ef ~1.
The simplified noise Ic~udness at each of the 33 frequencies is 10 calcul~tf3d as follows. First, the critical bandwidth Bj at the i-th frequency is calcul~t~-l using linear interpolation of the critical bandwidth listed in table 1 of Schar~s book chapter in Tobias. The result is the approxi")dled value of the term df/dx in equation (3) of '~chroeder ef al. The 33 critical bandwidth valuesare pre-computed and stored in a table. Then, for the i-th frequency, the 15 noise power Nj is co,llpared ~Ivith the masking threshold Mj. If N; < Mj, thenoise loudness Lj is set to zero. If Nj ~ Mj, then the noise loudness is C~lGlJl~t~ as Lj = Bj ((Nj-Mj)/(1+(S./Nj)2))o.25 where Sj is the sample value of the LPC power spectrum at the i-th frequency.

Once the noise loudne!ss is calclll~te~l for all 33 frequencies, the frequency with the maximum noise loudness is identifled and one bit is assigned to this frequency. The noise power at this frequency is then reduced by a factor which is empiricallly determined from the signal-to-noise ratio 25 (SNR) obtained during the design of the VQ codebook for qua"li~i"g the normalized FFT coefficients. (Illustrative values for the reduction factor are between 4 and 5 dB). The noise loudness at this frequency is then updated using the reduced noise power. Next, the maximum is again idenlir~ed from the updated noise loudness array, and one bit is assign to the corresponding 30 frequency. This process conl inues until all available bits are exhausted.

CA 022l93~8 l997-l0-24 W O97/31367 PCTnJS97/02898 For the 32 and 24 kb/s TPC coder, each of the 33 frequencies can receive bits during adaptive bit allocation. For the 16 kb/s TPC coder, on the other hand, better speech quality can be achieved if the coder assigns bits only to the frequency range of 0 to 4 kHz (i,e" the first 16 FFT coefficients) 5 and synthesi~es the residual FFT coefficients in the higher frequency band of 4 to 8 kHz using the high-frequency synthesis processor 140 Note that since the quantized LPC coefficients a are also available at the TPC decoder, there is no need to transmit the bit allocation i~rcr",alion.
This bit "~caLion infurllldlion is determined by a replica of the hearing model o quantizer control processor 50 in the decoder. Thus, the TPC decoder can locally duplicate the encoder's adaptive bit allocation operation to obtain suchbit allocation information.

e. Q-~a,~ at;~n of Transfo~m Co~rfic~nts The l,~"srorln coefficient quantizer 120 quantizes the transform 15 cOerri~ L~ contained in tc using the bit allocation signal ba. The DC term ofthe FFT is a real number, and it is scalar quantized if it ever receives any bitduring bit allocation. The maximum number of bits it can receive is 4. For second through the 16th FFT coefficients, a conventional two-dimensional vector quantizer is used to quantize the real and imaginary parts jointly. The 20 maximum number of bits for this 2-dimension VQ is 6 bits. For remaining FFT
coefficients, a conventional 4-dimensional vector quantizer is used to jointly quantize the real and imagi~ry parts of two adjacent FFT coeffcients. After the qua"li,dlion of lld,-srorm coerricie"l~ is done, the resulting VQ codebook index array IC contains the main i~rGm~liGn of the TPC encoder. This index 25 array IC is provided to the multiplexer 180, where it is combined with side information bits. The result is the final bit-stream, which is transmitted through a communication channel to the TPC decoder.
The transform coefficient quantizer 120 also decodes the quantized values of the normalized transform coerricienls. It then restores the original 30 gain levels of these transform coefficients by multipiying each of these W O 9713~367 PCT~US97/02898 coefficients by the corresponding elements of mag and the quantized linear gain of the co"esponding fr~quency band. The result is the output vector dtc.

f. High rl t.~lency Sy"lh~ ~is and Noise Fill-ln For the 16 kb/s coder, adaptive bit allocation is resL, i~;led to the 0 to 4 kHz band, and processor 14D synthesizes the 4 to 8 kHz band. Before doing so, the hearing model quantizer control processor 130 first c~lcl~l~tes the ratio between the LPC power spectrum and the masking threshold, or the signal-to-masking-threshold ratio (~;MR), for the frequencies in the 4 to 7 kHz band.
The 17th through the 29th FFT coefficients (4 to 7 kHz) are synthesized using phases which are random and magnitude values that are controlled by the SMR. For those frequencies with SMR > 5 dB, the magnitude of the FFT
coerricients is set to the qua~ ed linear gain of the high-frequency band. For those frequencies with SMR 5 5 dB, the magnitude is 2 dB below the quantized linear gain of the high-frequency band. From the 30th through the 33rd FFT co~rri~ " the magnitude ramps down from 2 dB to 30 dB below the quantized linear gain of the high-frequency band, and the phase is again random.
For 32 and 24 kb/s coders, bit allocation is performed for the entire 20 frequency band as describecl. However, some frequencies in the 4 to 8 kHz band may still receive no bits. In this case, the high-frequency synthesis and noise fill-in procedure described above is applied only to those frequencies receiving no bits.
After applying such hi!3h-frequency synthesis and noise fill-in to the 25 vector dtc, the resulting OUtplUt vector qfc contains the qu~"li ed version of the transform coerricie"l:j beFore normalization.

g. Inverse Transform and Filter r e~l ,, ry Urr' - l s The inverse transform processor 150 performs the inverse FFT on the 3 o 64-element complex vector rlepresented by the half-size 33 ele.oent vector CA 022193~8 1997-10-24 WO 97/31367 PCTrUS97/02898 qtc. This results in an output vector q~t, which is the quantized version of ff,the time-domain target vector for transform coding.
With zero initial filter states (filter memory), the inverse shaping filter 160, which is an all-zero filter having awc as its coeffcient array, filters the5 vector qtf to produce an output vector et The adder 170 then adds dh to et to obtain the quanti~ed LPC prediction residual dt. This df vector is then used to update the internal storage buffer in the closed-loop pitch tap quantizer and pitch predictor 80. It is also used to excite the internal shaping filter inside the zero-input response processor 50 in order to establish the correct filter 10 memory in preparation for the zero-input response generation for the next subr,~r"e.

C~ An Illu~l~ dli~e Decoder Embodiment An illustrative decoder embodiment of the present invention is shown 15 in Figure 2. For each frame, the demultiplexer 200 separates all main and side information components from the received bit-stream. The main infon"dliol1, the transform coefri-,ie.)t index array IC, is provided to the t,d,,srurm coefficient decoder 235. In order to decode this main information, adaptive bit allocation must be performed to determine how many of the main 20 information bits are associated with each quantized l~ansror"~ coefficient.
The first step in adaptive bit -"cç~t;on is the generation of qua~ ed LPC coefficients (upon which allocation depends). The demultiplexer 200 provides the seven LSP codebook indices IL(1') to IL(7) to the LPC parameter decoder 215, which performs table look-up from the 7 LSP VQ codebooks to 25 obtain the 16 quantized LSP coefficients. The LPC parameter decoder 215 then performs the same sorting, interpolation, and LSP-to-LPC coefficient conversion operations as in blocks 345, 350, and 355 in Figure 3.
With LPC coefficient array a calcul~ted, the hearing model quantizer control processor 220 determines the bit allocation (based on the quantized 30 LPC parameters) for each FFT coefficient in the same way as processor 130 ~ = --W ~ 97131367 PCTrUS97/02898 inthe TPC encoder (Figure ~I ). Similarly, the shaping filter coefficient processor 225 and the shaping filter magnitude response processor 230 are also replicas of the corresponding processors 30 and 100, respectively, in the TPC encoder. Processor 23D produces mag, the magnitude response of the 5 shaping filter, for use by the i.ldnsform coefficient decoder 235.
Once the bit allocation information is derived, the transform coefficient decoder 23~ can then correctly decode the main information and obtain the quantized versions of the normalized t~ sform coerricie, lls. The decoder 235 also decodes the gains using the gain index array /G. For each subfl ~n~e, 10 there are two gain indices (5 and 7 bits), which are decoded into the quantized log gain of the low frequency band and the quantized versions of the Icvcl ~djusted log gains af the mid-and high-frequency log gains. The quantized low-frequency log gain is then added back to the quantized versions of the Icvcl adjusted mid- and high-frequency log gains to obtain the 15 quantized log gains of the mid- and high-frequency bands. All three quantized log gains are then converted from the logarilll"lic (dB) domain back to the linear domain. Each ol the three quantized linear gains is used to multiply the quantized versions of the normalized tldl Isfor,-, coefficients in the corresponding frequency band. Each of the resulting 33 gain-scaled, 20 quanli~ecl lld"sfiorm coefficients is then further m~ltirlie-~ by the corresponding element in shaping filter magnitude response array mag.
After these two stages of scaling, the result is the decoded lldnsrur coefficient array dtc.
The high-frequency synthesis processor 24û, inverse tr~nsror"~
25 processor 245, and the inverse shaping filter 250 are again exact replicas ofthe corresponding blocks (14t), 150, and 160) in the TPC encoder. Together they perform high-frequency synthesis, noise fill-in, inverse transformation, and inverse shaping filtering to produce the quantized excitatiûn vector et.
The pitch decoder and interpolator 205 decodes the 8-bit pitch index IP
3 o to get the pitch period for the last 3 subframes, and then interpolate the pitch period for the first two subframes in the same way as is done in the corresponding block 70 of the TPC encoder. The pitch tap decoder and pitch CA 022193~8 1997-10-24 W O 97/31367 PCTrUS97/~2898 predictor 210 decodes the pitch tap index /Tfor each sul~r~me to get the three quantized pitch predictor taps b,k,b2k, and b3k It then uses the interpolated pitch period kpi to extract the same three vectors x"x2, and X3 as described in the encoder section. (These three vectors are respectively kpi-5 1, kpi, and kpi 1 1 samples earlierthan the currentframe of dt.) Next, itcomputes the pitch-predicted version of the LPC residual as dh = ~ bjkx; .

The adder 255 adds dh and et to get dt, the quantized version of the LPC prediction residual d. This dt vector is fed back to the pitch predictor o inside block 210 to update its internal storage buffer for dt (the filter memory of the pitch predictor).
The long-term po~lrilLt:r 260 is basically similar to the long-term postfilter used in the ITU-T G.728 standard 16 kb/s Low-Delay CELP coder.
The main difference is that it uses ~bjk, the sum of the three quantized pitch 15 taps, as the voicing indicdlor, and that the scaling factor for the long-termposl~l~r CGc rri~-enl is 0.4 rather than 0.15 as in G.728. If this voicing indicator is less than 0.5, the posllilLering operation is skipped, and the output vector fdt is identical to the input vector dt. If this indicator is 0.5 or more, the postfiltering operation is carried out.
The LPC synthesis filter 265 is the standard LPC filter--an all-pole, direct-form filter with the quantized LPC coefficient array a. It filters the signal fdt and produces the long-term poslrilLal~d, quantized speech vector st. This st vector is passed through the short-term postfilter 270 to produce the final TPC deGoder output speech signal ~st. Again, this short-term posffilter 270 is 25 very similar to the short-term postfilter used in G.728. The only differencesare the foll~wing. First, the pole-controlling factor, the zero-conl~olli"g factor, and the spectral-tilt controlling factor are 0.7, 0.55, and 0.4, respectively, ratherthan the corl~:sponding values of 0.75, 0.65, and 0.15 in G.728.
Second, the coefficient of the first-order spectral-tilt compensation filter is wo 97/31367 PCTAUS97102898linearly interpolated sample-by-sample between r,~l"es. This helps to avoid oco~sionally audible clicks due to discontinuity at frame boundaries.
The long-term and short-term posffilters have the effect of reducing the perceived level of coding noise in the output signal fst, thus enhancing the 5 speech quality.

Claims

What is claimed is:
1. A method of coding a frame of a speech signal comprising the steps of:
removing short-term correlations from the speech signal with use of a linear prediction filter to produce a prediction residual signal;
determining an open-loop estimate of the pitch period of the speech signal based on the prediction residual signal;
determining pitch filter tap weights for two or more subframes of the frame based on a quantized version of the prediction residual signal;
forming a pitch prediction residual signal based on the open loop pitch period estimate, the pitch filter tap weights for the two or more subframes, and the prediction residual signal; and quantizing the pitch prediction residual signal.
CA 2219358 1996-02-26 1997-02-26 Speech signal quantization using human auditory models in predictive coding systems Abandoned CA2219358A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US1229696P 1996-02-26 1996-02-26
US60/012,296 1996-02-26

Publications (1)

Publication Number Publication Date
CA2219358A1 true CA2219358A1 (en) 1997-08-28

Family

ID=21754300

Family Applications (1)

Application Number Title Priority Date Filing Date
CA 2219358 Abandoned CA2219358A1 (en) 1996-02-26 1997-02-26 Speech signal quantization using human auditory models in predictive coding systems

Country Status (5)

Country Link
EP (1) EP0954851A1 (en)
JP (1) JPH11504733A (en)
CA (1) CA2219358A1 (en)
MX (1) MX9708203A (en)
WO (1) WO1997031367A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6397178B1 (en) * 1998-09-18 2002-05-28 Conexant Systems, Inc. Data organizational scheme for enhanced selection of gain parameters for speech coding
US6778953B1 (en) * 2000-06-02 2004-08-17 Agere Systems Inc. Method and apparatus for representing masked thresholds in a perceptual audio coder
DE60209888T2 (en) 2001-05-08 2006-11-23 Koninklijke Philips Electronics N.V. CODING AN AUDIO SIGNAL
US7451091B2 (en) 2003-10-07 2008-11-11 Matsushita Electric Industrial Co., Ltd. Method for determining time borders and frequency resolutions for spectral envelope coding
DE102006022346B4 (en) * 2006-05-12 2008-02-28 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Information signal coding
CA3093517C (en) 2010-07-02 2021-08-24 Dolby International Ab Audio decoding with selective post filtering
WO2012161675A1 (en) * 2011-05-20 2012-11-29 Google Inc. Redundant coding unit for audio codec
CN103999153B (en) * 2011-10-24 2017-03-01 Lg电子株式会社 Method and apparatus for quantifying voice signal in the way of with selection
CN111862995A (en) * 2020-06-22 2020-10-30 北京达佳互联信息技术有限公司 Code rate determination model training method, code rate determination method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5012517A (en) * 1989-04-18 1991-04-30 Pacific Communication Science, Inc. Adaptive transform coder having long term predictor
FR2700632B1 (en) * 1993-01-21 1995-03-24 France Telecom Predictive coding-decoding system for a digital speech signal by adaptive transform with nested codes.

Also Published As

Publication number Publication date
MX9708203A (en) 1997-12-31
JPH11504733A (en) 1999-04-27
WO1997031367A1 (en) 1997-08-28
EP0954851A1 (en) 1999-11-10
EP0954851A4 (en) 1999-11-10

Similar Documents

Publication Publication Date Title
CA2185746C (en) Perceptual noise masking measure based on synthesis filter frequency response
CA2185731C (en) Speech signal quantization using human auditory models in predictive coding systems
US6014621A (en) Synthesis of speech signals in the absence of coded parameters
Gersho Advances in speech and audio compression
RU2262748C2 (en) Multi-mode encoding device
US6574593B1 (en) Codebook tables for encoding and decoding
US6604070B1 (en) System of encoding and decoding speech signals
Paliwal et al. Vector quantization of LPC parameters in the presence of channel errors
EP0503684B1 (en) Adaptive filtering method for speech and audio
US6961698B1 (en) Multi-mode bitstream transmission protocol of encoded voice signals with embeded characteristics
JP3490685B2 (en) Method and apparatus for adaptive band pitch search in wideband signal coding
US6098036A (en) Speech coding system and method including spectral formant enhancer
CA2140329C (en) Decomposition in noise and periodic signal waveforms in waveform interpolation
KR100304092B1 (en) Audio signal coding apparatus, audio signal decoding apparatus, and audio signal coding and decoding apparatus
US5699382A (en) Method for noise weighting filtering
US6119082A (en) Speech coding system and method including harmonic generator having an adaptive phase off-setter
US6081776A (en) Speech coding system and method including adaptive finite impulse response filter
JP4176349B2 (en) Multi-mode speech encoder
US6138092A (en) CELP speech synthesizer with epoch-adaptive harmonic generator for pitch harmonics below voicing cutoff frequency
MXPA96004161A (en) Quantification of speech signals using human auiditive models in predict encoding systems
KR20030046451A (en) Codebook structure and search for speech coding
Ordentlich et al. Low-delay code-excited linear-predictive coding of wideband speech at 32 kbps
CA2219358A1 (en) Speech signal quantization using human auditory models in predictive coding systems
JPH01261930A (en) Sound encoding/decoding system
CA2303711C (en) Method for noise weighting filtering

Legal Events

Date Code Title Description
EEER Examination request
FZDE Dead