GB2436192A

GB2436192A - A speech encoded signal and a long term predictor (ltp) logic comprising ltp memory and which quantises a memory state of the ltp logic.

Info

Publication number: GB2436192A
Application number: GB0605041A
Authority: GB
Inventors: Jonathan Alastair Gibbs; Halil Fikretler
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 2006-03-14
Filing date: 2006-03-14
Publication date: 2007-09-19
Anticipated expiration: 2026-03-14
Also published as: GB0605041D0; WO2007106638A3; GB2436192B; WO2007106638A2

Abstract

A speech communication unit (100) comprises a speech encoder (134) capable of representing an input speech signal. The speech encoder (134) comprises long-term prediction (LTP) logic having memory (215) operably coupled to quantization logic, wherein the quantization logic is arranged to quantize a memory state of the LTP logic. On the decoder side, a first speech decoder (260) receives the speech encoded bitstream and has a conventional long term predictor (LTP) memory element (265) that is driven by one or more previous codebook decisions made by the speech encoder. A second decoder (275) also receives the speech encoded bitstream and has an LTP memory element (280) that is updated by quantized values of a memory state of an LTP logic of the speech encoder.

Description

SPEECH COMMUNICATION UNIT INTEGRATED CIRCUIT ND METHOD

THEREFOR

Field of the Invention

Embodiments of the present invention relate to speech coding and methods for improving a performance of speech codecs in speech communication units. The invention is applicable to, but not limited to, improving a performance in error-prone channels.

Background of the Invention

Many present day voice communications systems, such as the global system for mobile communications (GSM) cellular telephony standard and the third generation cellular or Universal Mobile Telecommunications System (UMTS), use speech-processing units to encode and decode speech patterns. In such voice communications systems a speech encoder in a transmitting unit converts an analogue speech pattern into a suitable digital format for transmission. A speech decoder in a receiving unit converts a received digital speech signal into an audible analogue speech pattern.

As frequency spectrum for such wireless voice communication systems is a valuable resource, it is desirable to limit the channel bandwidth used by such speech signals, in order to maximise the number of users per frequency band. Hence, a primary objective in the use of speech coding techniques is to reduce the occupied capacity of the speech patterns as much as possible, by use of compression techniques, without losing fidelity.

In the field of Code Excited Linear Predictive (CELP) speech coders, speech coding techniques are adopted to provide high quality narrowband (300-3300 Hz) and wideband (50-7000 Hz) speech compression at low to medium bit rates (5-24 kb/s) for speech/audio communication units.

The high quality of these codecs can generally be attributed to the use of a long-term (or pitch) predictor (LTP), which allows the speech excitation over successive pitch periods to approach, in a perceptual sense, that of the source speech. However, the coding gain associated with LTP codecs comes at a cost.

The memory associated with maintaining the LTP state from frame to frame means that if either bit-errors or frame-errors are introduced into the transmission channel, the encoder or decoder states may begin to diverge. Such divergences usually persist until the end of a voiced utterance, or vowel, and may lead to synthesise speech that sounds very different from the source speech.

In the ITU-T, work has started on an embedded speech coder that should provide high quality wideband speech from an 8kb/sec. core and, in a series of four extra layers (4kb/sec., 4kb/sec., 8kb/sec. and 8kb/sec.), provide both quality improvements and error resilience improvements on this basic core codec. The quality requirements for the 8kb/sec. core codec are high, and mean that the coding gain of a conventional CELP codec with a conventional LTP is required.

Potentially, coding the state of the LTP memory may turn the coder into a memory-less system and overcome the error propagation effects. However, the method to achieve this with finite quantizatio resources is problematic since, if performed badly, it may consume many resources and yet still introduce significant errors into the synthesised speech following an error event.

Thus, a need exists for an improved speech communication JO unit and an integrated circuit and method of operation there for.

Summary of the Invention

In accordance with aspects of the present invention, there is provided a speech communication unit, an integrated circuit and method of operation therefor, as defined in Lhe appended Claims.

Brief Description of the Drawings

Exemplary embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings, in which: FIG. 1 shows a logic diagram of a wireless communication unit containing a speech coder adapted to support embodiments of the present invention; FIG. 2 shows a logic diagram of a code excited linear predictive speech coder adapted to support embodiments of the present invention; FIG. 3 illustrates a CELP speech encoder arrangement adapted to support embodiments of the present invention; FIG. 4 illustrates a CELP speech decoder arrangement adapted in accordance with the preferred embodiment of the present invention; FIG. 5 illustrates a view of the transpose of H' in accordance with embodiments of the present invention; FIG. 6 illustrates a view of the transpose of GP' in accordance with embodiments of the present invention; FIG. 7 illustrates a structure to simplify a H' matrix in accordance with embodiments of the present invention; FIG. 8 illustrates a structure to simplify a GP' matrix in accordance with embodiments of the present invention; and FIG. 9 illustrates an iterative method of sequential coding employed in embodiments of the present invention.

Description of Embodiments of the Present Invention In one embodiment of the present invention a speech communication unit comprises a speech encoder capable of representing an input speech signal. The speech encoder comprises long-term prediction (LTP) logic comprising LTP memory; and quantization logic operably coupled to the long-term prediction (LTP) logic. The quantizatio logic is arranged to quantize a memory state of the LTP logic.

The provision of quantization logic to quantize a memory state of the LTP logic may provide a more robust embedded coding of speech for packet networks.

In this manner, the provision of quantization logic to quantize a memory state of the LTP logic ensures that only features that are present in the LTP memory state that are Perceptually important are coded, rather than coding all of the features In one embodiment of the present invention, the quantization logic may be arranged to take into account at least one other decision made in the speech encoder.

In one embodiment of the present invention, the quantizatio logic may be arranged to quantize a memory state of the LTP logic with a plurality of pulses combined with one or more gains associated with the speech encoder.

In one embodiment of the present invention, the quantizatjon logic may quantize the memory state of the LTP logic using an analysis_by_synths5 process. A synthesis part of the analysis- by_synthesi5 process may use at least one matrix to transform an LPC residual signal, which may be a frame-long LPC residual signal.

In one embodiment of the present invention, an LTP component of the LTP memory state quantization is optimised for a weighted speech signal over a speech frame of interest minus a contribution of one or more of the following over the same speech frame: (i) stochastic codebook components; (ii) long-term pitch predictor periods and pitch gain decisions, and (iii) Pitch-filtered versions of the stochastic codebook components.

In one embodiment of the present invention, a perceptual weighting (H) matrix stores one impulse response per sub-frame that may be truncated. A plurality of peaks in each impulse response of a long-term pitch filtering of an excitation may be stored in compact form and may be stored as a two-dimensional array. The two dimensional array may comprise indices to a second array that comprises tap values, which may in turn comprise LTP gains and be analogous to an LTP filter. The two-dimensional array may be formed from a fractional pitch value and an LTP gain associated with each impulse sample in the speech frame of interest. A second array may comprise forward-looking pitch values for speech samples during a previous speech frame and a current speech frame.

In one embodiment of the present invention, the quantization logic may be operably coupled to a transmitter to transmit the quantized memory state of the LTP logic embedded within a primary bitstream.

In one embodiment of the present invention, the speech encoder may be a code excited linear predictive (CELP) encoder.

In one embodiment of the present invention a speech communication unit comprises a receiver for receiving a speech encoded bitstream from a speech encoder; a first speech decoder operably coupled to the receiver for receiving the speech encoded bitstream and having a conventional long term predictor (LTP) memory element that is driven by one or more previous codebook decisions made by the speech encoder. A second decoder is operably coupled to the receiver for receiving the speech encoded bitstream and has an LTP memory element that is updated by quantized values of a memory state of an LTP logic of the speech encoder. In one embodiment of the present invention, the first decoder and second decoder may be arranged to receive the same codebook and gain decisions provided by the speech encoder, which may be received at the same time.

In one embodiment of the present invention, the conventional LTP memory element of the first decoder may be arranged to receive the quantized values of a memory state of the LTP logic of the speech encoder in response to determining one or more errors in the speech encoded bitstream.

In one embodiment of the present invention, the data conveying the quantized values of a memory state of the LTP logic of the speech encoder may be embedded within the speech encoded bitstream. This has the advantage that if the communications channel conveying the speech bitstream has a very low bit rate, it would be possible to omit transmission of this additional data conveying the quantized values of a memory state of the LTP logic of the speech encoder.

In one embodiment of the present invention, an integrated circuit may comprise the aforementioned speech encoder and/or speech decoder.

In one embodiment of the present invention, a method of encoding in a speech communication unit comprises encoding a bitstream in a long term predictor (LTP) logic having an LTE memory; quantizing a memory state of the LTP logic; and transmitting an embedded signal comprising the bitstream and the quantized memory state of the LTP logic.

In one embodiment of the present invention, a method of decoding a speech encoded bitstream comprises receiving a speech encoded bitstream from a speech encoder in a first speech decoder and a second decoder; receiving at the second decoder quantized values of a memory state of an LTP logic of the speech encoder; and updating an LTP memory element in the second decoder in response thereto.

Embodiments of the present invention address the issue of coding the state of the LTP memory, to be conveyed as additional data of an embedded coder, in order to limit error propagation effects when channel errors are introduced. When combined with the excitation of the current speech frame, the quantization of the LTP memory state may also be employed to provide a second relatively independent coding of each speech frame, which may be used to improve the quality of the synthetic speech during error free periods using a multi-description coding method.

Turning now to FIG. 1, there is shown a logic diagram of a speech communication unit 100 adapted to support the inventive concept of embodiments of the present invention. The speech communication unit 100 contains an antenna 102 preferably coupled to a duplex filter or antenna switch 104 that provides isolation between a receiver chain and a transmitter chain within the speech communication unit 100.

As known in the art, the receiver chain typically includes a receiver front-end circuit 106 (effectively providing reception, filtering and intermediate or base-band frequency conversion) . The front-end circuit 106 is serially coupled to a signal processing function 108. An output from the signal processing function is provided to a suitable output device 110, such as a speaker via speech-processing logic 130.

The speech-processing logic 130 may comprise speech encoding logic 134 to encode a user's speech into a format suitable for transmitting over the transmission medium. The speech-processing logic 130 may also comprise speech decoding logic 132 to decode received speech into a format suitable for outputting via the output device (speaker) 110. The speech-processing logic is operably coupled to a memory element 116, via link 136, and a timer 118 via a controller 114.

In particular, the operation of the speech-processing logic 130 has been adapted to support the inventive concept of embodiments of the present invention. In particular, the speech-processing logic 130 has been adapted such that a speech encoder quantizes the long -10 -term (or pitch) predictor (LTP) memory state so as to minimize the weighted error between the synthesised speech and the source speech. In one embodiment of the present invention, quantization logic uses a series of simpljfjcations that make the aforementioned problem tractable within reasonable complexity and without adversely impacting quality. The adaptation of the sPeech-processing unit 130 is further described with respect to FIG's 2 to 4.

For completeness, the receiver chain also includes received signal strength indicator (RSSI) circuitry 112 (shown coupled to the receiver front-end 106, although the RSSI circuitry 112 could be located elsewhere within the receiver chain) . The RSSI circuitry is coupled to a controller 114 for maintaining overall subscriber unit control. The controller 114 is also coupled to the receiver front-end circuitry 106 and the signal processor 108 (generally realised by a digital signal processor (DSP)) . The controller 114 may therefore receive data with bit and/or frame errors and estimate the bit-error-rate (BER) or frame-error-rate (FER) data from the recovered information. Such a mechanism may be used to identify whether errors have been introduced into the communication channel. The controller 114 is coupled to the memory device 116 for storing operating regimes, such as decoding/encoding functions and the like. A timer 118 is typically coupled to the controller 114 to control the timing of operations (transmission or reception of time-dependent signals) within the speech communication unit 100.

-11 -In the context of the present invention, the timer 118 dictates the timing of speech signals, in both transmit (encoding) path and receive (decoding) paths.

As regards the transmit chain, this essentially includes an input device 120, such as a microphone transducer coupled in series via speech encoder 134 to a transmitter/modulation circuit 122. Thereafter, any transmit signal is passed through a power amplifier 124 to be radiated from the antenna 102. The transmitter/modulation circuitry 122 and the power amplifier 124 are operationally responsive to the controller, with an output from the power amplifier coupled to the duplex filter or antenna switch 104. The transmitter/modulation circuitry 122 and receiver front-end circuitry 106 comprise frequency up-conversion and frequency down-conversion functions (not shown) Of course, the various components within the speech communication unit 100 can be arranged in any suitable functional topology able to utilise the inventive concepts of the present invention. Furthermore, the various components within the speech communication unit can be realised in discrete or integrated component form, with an ultimate structure therefore being merely an arbitrary selection.

It is within the contemplation of the invention that the preferred buffering or processing of speech signals can be implemented in software, firmware or hardware, with preferably a software processor (Or indeed a digital signal processor (DSP)), performing the speech processing function.

-12 -Referring now to FIG. 2, an overview of the speech processing logic 130 is illustrated, highlighting both the encoder and decoder aspects. On the encoder side, a speech signal 205 is input to a code excited linear predictive (CELP) speech encoder 210 from a microphone (not shown) . The CELP speech encoder 210 comprises long term predictor memory 315, adapted as described later with respect to FIG. 3. An output from the CELP speech encoder 210 is sent over the main bit-stream, to be modulated, amplified frequency_convert and transmit from the wireless communication unit, as described with reference to FIG. 1.

The codebook and gain decisions made by the CELP speech encoder 210 are also input to a simplified OELP speech decoder logic 225, which comprises an LTP memory estimate 230 that is quantized (as described later) according to embodiments of the present invention. An output from the CELP speech decoder logic 225 is input to subtractor logic 235, which also receives the input speech signal 205, to produce an error signal. The error signal is input to squaring logic 240 to produce a mean square error (NSE) value, or a Perceptually weighted MSE value.

The MSE value is applied to the LTP vector quantisatjon search logic 250, which systematically determines the optimum quantized value of the LTP memory estimate 230 for which the MSE, or Perceptually weighted MSE, is minimized in an analysis_by_synthes5 (AbS) 255 manner.

An output from the LTP vector quantisation search logic 250 is sent to a receiver as additional embedded data to assist in the speech recovery (decoding) process. The -13 -output from the LTP vector quantisation search logic 250, in common with the stochastic excitation of a CELP codec, typically comprises a series of pulse Positions and signs accompanied by a gain or energy term.

At the speech decoder, a main bitstream is received following demodulation, amplification and frequency_down_ converting in a receiver chain of a receiving wireless communication unit, as described with reference to FIG. 1. The main bit-stream is input to two parallel CELP decoders 260, 275, which comprise respective LTP memory elements 265, 280. The CELP decoder 260 has a conventional LTP memory element, 265, which is driven by previous codebook decisions, whilst the second CELP decoder 275 has an unconventional LTP memory element, 280, which is updated by the quantized values of the additional embedded data 285, made by vector quantisation search logic 250.

It is noteworthy that the same codebook and gain decisions made in the conventional CELP encoder 210 are applied to both of the decoders. The outputs from the two parallel CELP decoders 260, 275 are input to weighted summation logic 270, which provides a high quality output when the communication channel is error free (in typical operating conditions) Following an error, where the conventional LTP memory 265 of CELP decoder 260 would be corrupted, a bootstrap 290 of LTP memory 265 is performed, using the additional enhancement data 285, which means that the outputs of the two CELP decoders 260 and 275 will be identical for that frame. The outputs from the two CELP decoders 260 and -14 - 275 are input to weighted sum logic 270, which is arranged to apply a suitable weight to the two outputs, and outputs a high quality signal 295.

Referring now to FIG. 3, a logic diagram of a code excited linear predictive (CELP) speech encoder 210 is shown, according to the preferred embodiment of the present invention. An acoustic input signal to be analysed is applied to speech coder 210 at microphone 302. The input signal is then applied to filter 304 to remove high frequency components which would otherwise cause aliasing during sampling.

The analogue speech signal from filter 304 is then converted into a sequence of N pulse samples, and the amplitude of each pulse sample is then represented by a digital code in analogue_to_digital (A/D) converter 308, as known in the art. The sampling rate is determined by sample clock (SC) . The sample clock (SC) is generated along with the frame clock (FC) via clock.

The digital output of A/D 308, which may be represented as input speech vector s(n), is then applied to coefficient analyser 310. This input speech vector s(n) is repetitively obtained in separate frames, e.g. logics of time, the length of which is determined by the frame clock (FC), as is known in the art.

For each logic of speech, a set of linear predictive coding (LPC) parameters is produced in accordance with a preferred embodiment of the invention by coefficient analyser 310. The generated speech coder parameters may include the following: LPC parameters, long-term -15 -predictor (LTP) parameters, excitation gain factor (G2) (along with the best stochastic codebook excitation codeword I) . Such speech coding parameters are applied to multiplexer 350 and sent over the channel 352 for use by the speech synthesizer at the decoder. The input speech vector s(n) is also applied to subtractor logic 330, the function of which is described later.

Within the conventional CELP encoder of FIG. 3, the codebook search controller 340 selects the best indices and gains from the adaptive codebook within LTP logic (adaptive codebook) 316 and the stochastic codebook 314 in order to produce a minimum weighted error in the summed chosen excitation vector used to represent the input speech sample. The output of the stochastic codebook 314 and the adaptive codebook 316 are input into respective gain functions 322 and 318. The gain-adjusted outputs are then summed in summer 320 and input into the LPO filter 324, as is known in the art.

In operation, first the adaptive codebook or long-term predictor component is computed 1 (n) . This is characterised by a delay and a gain factor G1'. For each individual stochastic codebook excitation vector u1(n), a reconstructed speech vector s'1(n) is generated for comparison to the input speech vector s (n) . Gain logic 322 scales the excitation gain factor G2' and summing logic 320 adds in the adaptive codebook component. Such gain may be pre-computed by coefficient analyser 310 and used to analyse all excitation vectors, or may be optimised jointly with the search for the best excitation codeword I, generated by codebook search controller 340. /

-16 -The scaled excitation signal Gil(n) + G2 u(n) is then filtered by the linear predictive coding filter 324, which constitutes a short-term predictor (STP) filter, to generate the reconstructed speech vector s'1(n) The reconstructed speech vector s'(n) for the i-th excitation code vector is compared to the same logic of input speech vector s(n) by subtracting these two signals in subtractor 330.

The difference vector e1 (n) represents the difference between the original and the reconstructed blocks of speech. The difference vector is Perceptually weighted by weighting filter 332, utilising the weighting filter parameters (WTP) generated by coefficient analyser 310.

Perceptual weighting accentuates those frequencies where the error is Perceptually more important to the human ear, and attenuates other frequencies.

An energy calculator function inside the codebook search controller 340 computes the energy of the weighted difference vector e'(n) . The codebook search controller compares the i-th error signal for the present excitation vector u1(n) against previous error signals to determine the excitation vector producing the minimum error. The code of the i-th excitation vector having a minimum error is then output over the channel as the best excitation code I. A copy of the scaled excitation G1l (n) + G2 u1 (n) is stored within the Long Term Predictor memory of 315 for future use.

-17 -In an alternative embodiment, codebook search controller 340 may determine a particular codeword that provides an error signal having some predetermined criteria, such as meeting a predefined error threshold.

A more detailed description of the functionality of a typical speech encoding unit can be found in "Digital speech coding for low-bit rate communications systems" by A. M. Kondoz, published by John Wiley in 1994.

In accordance with embodiments of the present invention, the speech encoding and/or speech decoding logic has been adapted to quantize the long-term (or pitch) predictor (LTP) memory state, so as to minimise the weighted error between the synthesised speech and the source speech. In embodiments of the present invention, the quantization procedure uses a series of simplifications, in logic elements 225, 230, 235, 240 and 250 of FIG. 2, that make the aforementioned problem tractable within reasonable complexity without adversely impacting quality.

The adapted speech encoding logic and speech decoding logic (and associated methods of operation) assume that the same LTP updates that have been computed for the core codec are once again applied to this LTP state 280 in order to arrive at the final synthetic speech 295.

Thus, in this manner, the LTP state may be quantized in a very compact fashion, with only a few pulses. This version of the LTP state may then be employed in the decoder to function in two key ways. Firstly, it can be employed to provide an error resilience improvement, bootstrapping, 290, the LTP state after frame loss. It

I

-18 -may also be used in an error-free environment to provide a second description of the source waveform to provide a small multiple-description coding (MDC) gain in the output 295.

The decoder functionality is substantially the reverse of that of the encoder as illustrated in FIG. 4. Within a conventional CELP decoder, the best indices and gains sent in the main bitstream (say, from the adaptive codebook within logic 316 of FIG. 3) are used with the stochastic codebook 414 in order to produce a minimum weighted error in the summed chosen excitation vector used to replicate the source speech sample. Thus, the output of the stochastic codebook 414 is input into gain function 422 and the LTP output is input to respective gain function 418. The gain-adjusted outputs are then summed in summer 420 and input into the LPC filter 424 and LPC and Pitch post filter 425, as is known in the art.

Firstly the adaptive codebook or long-term predictor component is computed 1 (n) . This is characterised by a delay and a gain factor G1'.

For each individual stochastic codebook excitation vector u1(n), a reconstructed speech vector s'(n) is generated for comparison to the input speech vector s (n) . Gain logic 422 scales the excitation gain factor G2' and summing logic 420 adds in the adaptive codebook component. The scaled excitation signal Gil(n) + G2 u(n) is then filtered by the linear predictive coding filter 424, which constitutes a short-term predictor (STP) -19 -filter, to generate the reconstructed speech vector s'1(n) Whilst embodiments of the present invention are described with reference to a CELP coder, it is envisaged by the inventors that any other speech-processing logic using pitch prediction, where transmission errors may occur, may benefit from the inventive concept contained herein.

The inventive concept described herein finds particular use in speech processing units for wireless communication units, such as universal mobile telecommunication system (UMTS) units, global system for mobile communications (GSM), TErrestrial Trunked RAdio (TETRA) communication units, Voice over Internet Protocol (VoIP) units, etc. In one embodiment of the present invention, a method of coding the LTP state is derived in order to minimize the weighted error over the entire frame of speech that uses that LTP state represented in FIG. 2 by logic elements 225, 230, 235, 240 and 250. In order to obtain an optimum LTP component, the perceptually weighted error, 1) is minimized, as follows: 17 = Is _spI2 = 2LTPmcm T(GP)THTH(GP)LTP -22LTPmCm (GP)1fIT (s -so)+Is _s2 [1] where: 2is the LTP memory state gain, LTPmem is the quantized LTP memory state, s0 is the zero LTP memory response (with fixed codebook excitation) of the linear prediction (LP) filter, -20 -s is the target speech signal, H is the LP filtering and perceptual weighting matrix, and GPis the matrix thattransforms the LTP memory into a frame long residual.

The target signal in this case is the weighted speech signal less the contribution due to the stochastic codebook components, along with their pitch filtered components, during the frame of interest.

Embodiments of the present invention propose further improvements by refining the two main matrices, H' and GP' . H' is of the form; h 0 0.

h0 h0 0.

H = h h11 h2() i i hOFL. . . . h0 FL,Fl, Where: hnmjg the synthesized value at themthsample when the LTP component is an impulse atnthposition.

For a typical wideband speech codec (for example in the case of the ITU-T Recommendation G.722.2 codec) this matrix has 64k entries (i.e. 256 x 256) An alternative view of H' (this time the transpose of H') is provided in FIG. 5. Here, the transpose of H' 505 indicates a series of four impulse responses 515, 520, 525, 530 over a frame length 510.

-21 -The problem with the transpose of H' matrix in FIG. 5 is that, unlike the analogous H' matrix in conventional sub-frame processing for CELP coders, the form of each column is not necessarily the same as that of the sample before. This is due to the fact that the LPC filter changes at every sub-frame boundary. In fact every column may be different, except for those entries of the last sub-frame that follow the conventional form with each sample being the same as the sample before but with a single sample delay.

The GP matrix is of the form; g00 g01 GP= g10 g11 FLrMax Pitch Where: gn1 represents the th sample of the synthetic speech frame when the LTP component contains an impulse at m' position with it's magnitude determined by the LTP gain applied in the frame.

Again, an alternative view of GP', i.e. the transpose of GP' 600, is illustrated in FIG. 6. Here, the transpose of GP' 600 indicates a series of four impulse responses 615, 620, 625, 630 over a frame length 610, whereby the pitch components are more easily seen, as are the fractional pitch filter response components.

Again for a typical wideband speech codec (ITU-T G.722.2) this matrix has 57.5k entries (231 x 256) . This matrix captures both the LTP gain and pitch components over the frame, including the sub-sample pitch filtering. )

-22 -Advantageously, the matrix can be easily calculated using standard codec logics, throughout the frame. However, such calculations result in a significant increase in complexity, as the LTP filtering must be repeated for each sample of the pitch period (231 positions worst case) and for all sub-frames of the speech frame.

The two matrices GP' and H' may then be combined, once they are known, into a single matrix of 57.5k elements H.GP'.

For a given LTP state, the optimum A can be obtained by equating the first derivative of the error with respect to A, as shown below in equation [2] -2ALTP T( )THTH(GLTPm,m -2LTP T (GP)THT (s -= 0 LTPmemT(GP)THT(S_) [2] L TPmemT (GP)T HTH(GP)LTP Substituting this into equation [1] yields: = -(LTPT(GP)THT(S _SO))2 [3] LTPme (GP)THTH(GP)LTP Using the error term of equation [3], and the gain term of equation [2], it is possible to construct quantization search algorithms that will return the best possible quantized LTP state.

In one embodiment of the present invention, a simplification to the H' matrix may be made, in order to reduce both the memory requirements and the processing -23 -requirements In this embodiment, Only four impulse responses 710 are stored, as shown in FIG. 7. One impulse response is stored for each sub-frame, with a changeover point defined three quarters through each of the first three sub-frames. In addition, the impulse responses are truncated significantly to less than a sub-frame (24 samples) length 705, thereby reducing the storage of the H' matrix from 64k words to only 96' words. These shorter impulse responses also reduce the complexity required to calculate the H.GP' matrix product.

However, the derivation of the GP' matrix is responsible for much of the complexity in computing Equation [3] Hence, in one embodiment of the present invention, a further simplification is applied to the GP' matrix.

This further simplification emanated from the inventors of the present invention recognising and appreciating that there will be one peak for each sample in the speech frame, traceable back to one of the pitch pulses in the LTP memory. Therefore, the structures illustrated in FIG. 8 are able to characterise the entries in the GP' matrix and store the peaks in the response, with each peak being stored as three non-negative values, two samples each side the peak. This is analogous to a series of threetap long-term predictor filters.

The first 2-D array 805 contains a series of indices into the second array 810, where the tap values, including the LTP gains, are stored. Notably, these structures require a total of 3.5k entries (i.e. 231 x (256/34 + 4) + 256 x 3), rather than the original 57.5k.

-24 -The arrays 805, 810 are formed by first creating a fractional pitch value and an LTP gain associated with each sample in the frame by analysis of the codec parameters. This initial step is relatively easy to do for all CELP codecs. Conventional CELP codecs maintain the same pitch value over a complete sub-frame; whereas RCELP codecs interpolate the pitch from one sample to the next over a frame. Hence, RCELP codecs, such as EVRC can be accommodated, albeit at slightly higher complexity.

It is then useful to back-track through this fractional pitch array 805, in order to compute a second array 810 containing the forward-looking pitch value (or at least values up to a maximum of 2'), which are effective at each sample throughout the last pitch period of the LTP memory and throughout the current frame, except for the final pitch period.

In one embodiment of the present invention, multiple forward-looking pitch possibilities may be considered at this stage, to address the problem in conventional CELP codecs where a pitch period may change abruptly and generate overlaps at the sub-frame boundaries or gaps in the track. In this embodiment, a simple recursive routine may be used to explore the forward-looking pitch value array, taking into account the potential branches in the array. For each sample of the last pitch period of the LTP memory, the simple recursive routine determines a position (first of the final 2-D arrays), and gains of the peaks, by combining the fractional sample placements and the three relevant positive low-pass filter samples (second of the final 2-D arrays) -25 -The formation of these arrays, using the method described above, may be performed very efficiently in 1.43 weighted MOPS, worst case, instead of the exhaustive method for GP' of 150 weighted MOPs, worst case.

Unfortunately, the H.GP' matrix product cannot be stored as efficiently as either GP' or H' individually, since it turns out to have a pitch length x frame length region that is close to full-rank. Hence, in one embodiment of the present invention it is envisaged that equations [2] and [3] are computed without the 57.5k storage requirement. In this embodiment, the entries of the H.GP' matrix product may be calculated on-the--fly' during the LTP coding search. This calculation, based upon the previously described structures may be easily achieved, however, by making use of the pulse locations and gains for a given LTP memory location, and scaling the appropriate shortened impulse responses from the H' entries before accumulation.

A conventional sequential coding method may be used, as described in Hanzo et al, "Voice Compression & Communications", Wiley, 2001, where pulses are placed one at a time by maximising the second term of equation [3]) at each stage. The effect of each pulse is then removed using an optimal gain.

Finally, a realistic implementation of a sequential search for quantizing the LTP memory state may be performed, where the basic iterative process is illustrated in the sequential coding flowchart 900 of -26 -FIG. 9. A first pulse is placed by finding a maximum magnitude entry in: (GP)THT(s-s0) [4] where the pulse sign is that of the entry, as shown in step 905. The optimal value for,t,, is then computed from equation [2], for excitation comprising all pulses. A revised weighted computation is then performed in step 910, thereby forming: (s-0)-A. (GP)lHTë [5] The next pulse is then placed by finding a maximum magnitude entry in: (GP)THT((s_s_O)_2(Gp)THre) [6] where the pulse sign is that of the entry, as shown in step 915. A determination is then made as to whether all pulses have been placed, in step 920. If all pulses have not yet been placed, in step 920, the method loops back to step 910 for the next pulse to be placed.

If all pulses have been placed, in step 920, the optimal value for 2, is then computed again from equation [2], for excitation comprising all pulses. Thus, such a sequential coding approach leads to a solution, as in step 925, where: LTP=2e [7] -27 -The sequential coding method is then complete, as shown in step 930. Observing Equations [2] and [3], the denominator term: 1/LTP T(GP)THTH(Gp)LTp [8] is assumed in the above case to be more or less constant, as we maximise the numerator term of equation [3] to place the pulses. Whilst this assumption is often made during conventional sub-frame CELP processing, it turns out to be better to take this into account when we quantize the LTP memory. The inventors of the present invention identified that by pre-multiplying each of the entries during the maximum magnitude search by the appropriate entry in an array containing the reciprocals of the square roots of the entries in the vector (GP)THTH(GP), the quality of the search was improved significantly.

Although, ideally, the denominator term should be modified to more precisely follow that of equation [3] as Lhe pulses are placed, it was found that much of the benefit of this term was preserved with the fixed array, with clear complexity benefits.

The calculation of such a fixed array may also be simplified significantly by using a product of the squares of the GP' peaks and the total energy of the four shortened H' responses to derive the fixed array entry values. A i/\& function is required for each sample of the pitch period, which is a complex procedure -28 -using fixed point DSPs, although a numerical approximation using a Taylor or MacLaurin series may often suffice for such functions.

In one embodiment of the present invention, the results of this fixed array may be stored for the duration of the pulse search, say, in a 231-word array. This involves quantizing the long-term (or pitch) predictor (LTP) memory state so as to minimise the weighted error between the synthesised speech and the source speech using a series of simplifications, which make the problem tractable within reasonable complexity without adversely impacting quality.

Table 1 provides an indication of the complexity and quality improvement when comparing the aforementioned implementation with a known sequential search over the entire frame with little or no simplification. The comparison is made using the known 3rd generation partnership project (3GPP2) wideband speech codec, known as VMR. 6 pulses have been used to encode the LTP memory state, since this compact quantization may be transported along with other parameters within one of the 4 kb/s layers above the core data stream for the ITtJ-T Embedded wideband speech codec, as described above. In order to quantify the benefit, values for a direct pulse code modulated (PCM) quantization of the LTP memory with 6' pulses are also given. As shown, the best case MDC gain quality is computed by calculating an optimal gain mixing parameter such that the final speech is a weighted combination of the VMR synthesised speech and the speech produced by bootstrapping the LTP memory with the -29 -quantized version calculated according to the methods described in the above embodiments.

Notably, a complexity reduction in implementing the above concept is over 98% (e.g. an improvement to 54 WMOPs from 3354 WMOPs) . The reduction in memory requirements is similarly 98%, whilst the quality penalty has been determined as being negligible when using the ITU-T Recommendation P.862 Objective Speech quality measure, known as Wideband PESQ 1405.

Table 1:

LTP Quantization Worst Case ITU-T P.862 ITU-T Algorithm Frame WB PESQ MOS P.862 Configuration Encoder (100% LTP WB PESQ Complexity Memory os (WMOPs) Replacement (Best at Decoder) Case

MDC

_____________ Quality) Standard VMR Encode 27.5 -3.920 6 Pulses Direct 27.8 3.120 3.921 Quantization from LTP Memory 6 Pulses Sequential Search 3382.4 3.386 3.930 over Frame With No Simplification (Requires 179 kwords) 6 Pulses Sequential Search 81.8 -3.380 3.933 over Frame With Full Simplifications (Requires 3.6 kwords) Th -30 -It will be understood that the improved speech communication unit, associated integrated circuit providing speech encoding and/or decoding functions and method of operation therefor, as described above, aims to provide at least one or more of the following advantages: (i) Future cellular and Voice over the Internet Protocol (VoIP) standards may benefit from the embodiments described above, although the inventive concept is particularly applicable for the ITU-T SG-16's standard for the robust embedded coding of speech for packet networks.

(ii) The inventive concept provides a solution to the problem of quantizing the LTP memory, with quality similar to the well-known full codebook search method.

(iii) The inventive concept provides a solution with significantly lower complexity and memory requirements.

(iv) The inventive concept provides a solution that supports an implementation of a speech encoder-decoder on a single CPU DSP. At the time of writing a single CPU DSP can be expected to provide approximately NIPs which is almost twice the encoder complexity

given in Table 1.

In particular, it is envisaged that the aforementioned inventive concept can be applied by a semiconductor manufacturer to any CELP-based speech encoding/decoding integrated circuit. It is further envisaged that, for example, a semiconductor manufacturer may employ the inventive concept in a design of a stand-alone device, or application-specific integrated circuit (ASIC) and/or any other sub-system element to support speech processing.

-31 -It will be appreciated that any suitable distribution of functionality between different functional devices or elements or logic, may be used without detracting from the inventive concept herein described. Hence, references to specific functional devices or elements or logic are only to be seen as references to suitable means for providing the described functionality, rather than indicative of a strict logical or physical structure or organization.

Aspects of the invention may be implemented in any suitable form including hardware, software, firmware or any combination of these. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed, the functionality may be implemented in a single unit or IC, in a plurality of units or ICs or as part of other functional units.

Although the present invention has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein.

Rather, the scope of the present invention is limited only by the accompanying claims. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention. In the claims, the term comprising' does not exclude the presence of other elements or steps.

Furthermore, although individual features may be included in different claims, these may possibly be advantageously -32 -combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Also, the inclusion of a feature in one category of claims does not imply a limitation to this category, but rather indicates that the feature is equally applicable to other claim categories, as appropriate.

Furthermore, the order of features in the claims does not imply any specific order in which the features must be performed and in particular the order of individual steps in a method claim does not imply that the steps must be performed in this order. Rather, the steps may be performed in any suitable order. In addition, singular references do not exclude a plurality. Thus, references to "a", "an", "first", "second" etc. do not preclude a plurality.

Thus, an improved speech communication unit and an integrated circuit and method of operation therefor have been described, wherein the aforementioned disadvantages with prior art arrangements have been substantially alleviated.

Claims

-33 -Claims 1. A speech communication unit (100) comprising a speech

encoder (210) capable of representing an input speech signal, the speech encoder (210) comprising: long-term prediction (LTP) logic comprising LTP memory (315); and quantization logic operably coupled to the speech encoder (210), wherein the quantization logic is arranged to quantize a memory state of the LTP logic.

2. The speech communication unit (100) of Claim 1 wherein the quantization logic is arranged to take into account at least one other decision made in the speech encoder (210), for example a decision of at least one subframe regarding a selection of one or more of: a stochastic codebook or a gain, or a pitch period or a pitch gain.

3. The speech communication unit (100) of Claim 1 or Claim 2 wherein the quantization logic is arranged to quantize a memory state of the LTP logic with a plurality of pulses combined with one or more gains associated with the speech encoder (210) 4. The speech communication unit (100) of any preceding Claim wherein the quantization logic quantizes the memory state of the LTP logic using an analysis-by-synthesis process.

5. The speech communication unit (100) of Claim 4 wherein a synthesis part of the analysis-by-synthesis -34 -process uses at least one matrix to transform an estimate of the LTP memory state into a perceptually weighted residual signal.

6. The speech communication unit (100) of Claim 5 wherein the analysis-by-synthesis process uses at least one matrix to transform an estimate of the LTF memory state into a frame-long perceptually weighted residual signal.

7. The speech communication unit (100) of any preceding Claims 4 to 6 wherein an LTP component of the LTP memory state quantization is optimised for a weighted speech signal over a speech frame of interest minus a contribution of one or more of the following over the same speech frame: (1) stochastic codebook components; (ii) long-term pitch predictor periods and pitch gain decisions, (iii) pitch-filtered versions of the stochastic codebook components.

8. The speech communication unit (100) of any of preceding Claims 5 to 7 wherein a perceptual weighting (H) matrix is stored as a sequence of impulse responses, with one per sub-frame.

9. The speech communication unit (100) of Claim 8 wherein the impulse responses per sub-frame are truncated to a period less than a subframe length.

10. The speech communication unit (100) of Claim 8 or Claim 9 wherein a plurality of peaks in each impulse -35 -response of a long-term pitch filtering of an excitation are stored in compact form.

11. The speech communication unit (100) of Claim 10 wherein the plurality of peaks in each impulse response of the long-term pitch filtering of the excitation are stored as a two-dimensional array.

12. The speech communication unit (100) of Claim 11 wherein the two dimensional array comprises indices to a second array that comprises tap values analogous to an LTP filter.

13. The speech communication unit (100) of Claim 12 wherein the tap values of the second array comprise LTP gains.

14. The speech communication unit (100) of any of preceding Claims 11 to 13 wherein a first two-dimensional array is formed from a fractional pitch value and an LTP gain associated with each impulse sample in the speech frame of interest.

15. The speech communication unit (100) of any of preceding Claims 12 to 14, wherein the second array comprises forward-looking pitch values for speech samples during a previous speech frame and a current speech frame.

16. The speech communication unit (100) of any of the preceding Claims wherein the quantization logic is operably coupled to a transmitter to transmit the -36 -quantized memory state of the LTP logic embedded within a primary bitstream.

17. The speech communication unit (100) of any of the preceding Claims wherein the speech encoder is a code excited linear predictive encoder.

18. A speech communication unit (100) comprising: a receiver for receiving a speech encoded bitstream from a speech encoder; a first speech decoder (260) operably coupled to the receiver for receiving the speech encoded bitstream and having a conventional long term predictor (LTP) memory element (265) that is driven by one or more previous codebook decisions made by the speech encoder; and a second decoder (275) operably coupled to the receiver for receiving the speech encoded bitstream and having an LTP memory element (280) that is updated at a start of each frame with quantized values of a memory state of an LTP logic of the speech encoder.

19. The speech communication unit (100) of Claim 18 wherein the first decoder (260) and second decoder (275) are arranged to receive the same codebook and gain decisions provided by the speech encoder.

20. The speech communication unit (100) of Claim 19 wherein the first decoder (260) and second decoder (275) are arranged to receive the same codebook and gain decisions provided by the speech encoder at the same time.

-37 - 21. The speech communication unit (100) of any of preceding Claims 18 to 20 wherein the conventional LTP memory element (265) of the first decoder (260) is arranged to receive the quantized values of a memory state of the LTP logic of the speech encoder in response to determining one or more errors in the speech encoded bitstream.

22. The speech communication unit (100) of any of preceding Claims 18 to 21 wherein the quantized values of a memory state of the LTP logic of the speech encoder are embedded within the speech encoded bitstream.

23. An integrated circuit comprising the speech encoder or speech decoder of any of the preceding Claims.

24. A method of encoding in a speech communication unit comprising: encoding a bitstream in a long term predictor (LTP) logic having an LTP memory; and quantizing a memory state of the LTF logic; transmitting an embedded signal comprising the bitstream and the quantized memory state of the LTP logic.

25. A method of decoding a speech encoded bitstream comprising: receiving a speech encoded bitstream from a speech encoder in a first speech decoder (260) and a second decoder (275); and receiving at the second decoder quantized values of a memory state of an LTP logic of the speech encoder; and -38 -updating an LTP memory element in the second decoder in response thereto.

26. The method of Claim 25 further comprising receiving the same codebook and gain decisions provided by the speech encoder at the same time.

27. The method of Claim 25 or Claim 26 wherein the step of updating is performed in response to determining one or more errors in the speech encoded bitstream.