WO2001009880A1

WO2001009880A1 - Multimode vselp speech coder

Info

Publication number: WO2001009880A1
Application number: PCT/EP2000/007566
Authority: WO
Inventors: Jonathan Alastair Gibbs; Dominic Chan; Mark A. Jasiuk
Original assignee: Motorola Limited
Priority date: 1999-08-02
Filing date: 2000-08-02
Publication date: 2001-02-08
Also published as: EP1212750A1; GB9917916D0; GB2352949A; AU7272100A

Abstract

Disclosed is a multiple PSI-VSELP (Pitch Synchronised Vector Sum Excited Linear Prediction) speech coder. Exitation vectors are generated from a set of basis vectors. The basis vectors storage is structured in three codebooks. A first codebook is used for coding unvoiced signals. A second codebook for voiced signals having a pitch smaller than a predetermined value, and the third codebook for the remaining pitch values. The basis vectors are phase synchronized with the signal energy profile.

Description

MULTT ODE VSELP SPEECH CODER

Field of the Invention

This invention relates to speech coding techniques. The invention is applicable to, but not limited to, speech codecs and, in particular, to methods of utilising speech codecs in radio communications systems.

Background of the Invention

Many voice communications systems, such as the TErrestrial Trunked RAdio (TETRA) system for private mobile radio users, use speech processing units to encode and decode speech patterns. In such voice communications systems the speech encoder converts the analogue speech pattern into a suitable digital format for transmission and the speech decoder converts a received digital speech signal into an appropriate analog speech pattern.

As spectrum for such voice communications systems is a valuable resource, it is desirable to limit the channel bandwidth used, to maximise the number of users per frequency band. Hence, the primary objective in the use of speech coding techniques is to reduce the occupied capacity of the speech patterns as much as possible, by use of compression techniques, without losing fidelity of speech signals.

Speech coding typically uses speech production modelling techniques to compress pulse code modulation (PCM) speech signals into bit-rates that are suitable for different kinds of bandwidth-limited applications such as speech communication systems or voice storage systems.

The basic speech production model, that is commonly used in speech coding algorithms, uses linear predictive coding (LPC). The LPC filter models the combined effect of the glottal pulse model, the vocal tract and the lip radiation. For voiced speech, the voiced excitation, which consists of a pulse train separated by the pitch duration T, is used as an input signal to the LPC filter. Alternatively, for unvoiced speech, a Gaussian noise source is used as the LPC filter input excitation. Advances in speech coding development led to the introduction of Analysis by Synthesis technique used in CELP (Code Excited Linear Prediction) such as (Algebraic Code Excited Linear Prediction).

The present invention generally relates to digital speech coding at low data (bit) rates, and more particularly, is directed to an improved method for coding excitation information for such code-excited linear predictive (CELP) speech coders.

CELP is a speech coding technique that has the potential of producing high quality synthesised speech at low bit rates, i.e. 4 to 16 kilobits-per-second (kbps). This class of speech coding is used in numerous speech communications and speech synthesis applications.

In the art of speech coding, the term "code-excited" or "vector-excited" is derived from the fact that an excitation sequence for the speech coder is vector quantised, i.e. a single codeword is used to represent a sequence, or vector, of excitation samples. In this way, it is possible to achieve data rates of less than one bit per sample for coding an excitation sequence. Stored excitation code vectors generally consist of independent random white Gaussian sequences. One code vector from a codebook of stored excitation code vectors is used to represent a block of, say N excitation samples. Each stored code vector is represented by a particular codeword, i.e. an address of the code vector memory location. It is this "codeword" that provides the best representation of an input speech signal which is subsequently sent over a communications channel to the speech synthesiser to reconstruct the speech frame at the receiver. A summary of the CELP speech coding technique is described in M.R. Schroeder and B.S. Atal, "Code-Excited Linear Prediction (CELP): High-Quality Speech at Very Low Bit Rates", Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 3, pp 937-40, March 1985.

In advanced CELP speech coders, further aspects of the speech signal are characterised, for example the long term ("pitch") and short term ("formant") predictors which model the characteristics of the input speech signal are incorporated in a set of time- varying linear filters. In such an advanced speech coder, an excitation signal for the filters is chosen from a codebook of stored innovation sequences, or code vectors. For each frame of speech that is coded, the speech coder applies each individual code vector to the filters to generate a reconstructed speech signal, and compares the original input speech signal to the reconstructed signal to create an error signal. The error signal is then weighted by passing it through a weighting filter having a response based on human auditory perception. The optimum excitation signal is determined by selecting the code vector that produces the weighted error signal with the minimum energy of error for the current frame.

The difficulty of the CELP speech coding technique lies in the very high computational complexity required to perform an exhaustive search of all the excitation code vectors in a typical codebook, for example at a sampling rate of 8 kilohertz (KHz), a 5 millisecond (msec) frame of speech would consist of 40 samples. If the excitation information were coded at a rate of 0.25 bits per sample (corresponding to 2 kbps), then 10 bits of information are used to code each 5 msec, frame. Hence, the random codebook would then contain 2¹⁰, or 1024, random code vectors.

A vector search procedure in such a coder requires approximately 15 multiply-accumulate computations (MACs) (assuming a third order long-term predictor and a tenth order short-term predictor) for each of the 40 samples in each code vector. This corresponds to 600 MACs per code vector per 5 msec speech frame, or approximately 120,000,000 MACs per second (600 MACs/5 msec frame x 1024 code vectors). One can now appreciate the extraordinary computational effort required to search the entire codebook of 1024 vectors for the best fit - an unreasonable task for real-time implementation with today's digital signal processing technology.

Moreover, the memory allocation requirement to store the codebook of independent random vectors is also exorbitant. For the above example, a 640 kilobit read-only- memory (ROM) would be required to store all 1024 code vectors, each having 40 samples , each sample represented by a 16-bit word. This ROM size requirement is inconsistent with the size and cost goals of many speech coding applications. Such onerous technical requirements of standard, prior art, code-excited linear prevent the technique from being a practical approach to speech coding.

One approach for reducing the computational complexity of this code vector search process is to implement the search calculations in a transform domain, as described in I.M. Trancoso and B.S. Atal's paper titled, "Efficient Procedures for Finding the Optimum Innovation in Stochastic Coders", Proceedings of ICASSP, Vol. 4, pp 2375-8, April 1986. Using this approach, discrete Fourier transforms (DFT's) or other transforms may be used to express the filter response in the transform domain such that the filter computations are reduced to a single MAC operation per sample per code vector. However, an additional 2 MACs per sample per code vector are also required to evaluate the code vector, thus resulting in a substantial number of multiply-accumulate operations, i.e. 120 per code vector per 5 msec frame, or 24,000,000 MACs per second in the above example. Still further, the transform approach requires at least twice the amount of memory, since the transform of each code vector must also be stored. In the above example, a 1.3 Megabit ROM would be required for implementing CELP using transforms.

A second approach for reducing the computational complexity is to structure the excitation codebook such that the code vectors are no longer independent of each other. In this manner, the filtered version of a code vector can be computed from the filtered version of the previous code vector, again using only a single filter computation MAC per sample.

This approach results in approximately the same computational requirements as transform techniques, i.e. 24,000,000 MACs per second, while significantly reducing the amount of ROM required to 16 kilobits in this example. Examples of this type of codebook are given in the article titled "Speech Coding Using Efficient Pseudo-Stochastic Block Codes", Proceedings of ICASSP, Vol. 3, pp 1354-7, April 1987, by D. Lin. The ROM size is based on 2^M x # bits/word, where M is the number of bits in the codeword such that the codebook contains 2^M code vectors. Therefore, the memory requirements still increase exponentially with the number of bits used to encode the frame of excitation information, for example, the ROM requirements increase to 64 kilobits when using 12 bit codewords.

A third approach for reducing the computational and storage complexity is to structure the excitation codebook such that the code vectors consist of a small number of unit impulses (typically up to 10 per 5 msec, frame). These unit impulses are allowed to have complementary sign (+/-1). Efficient sub-optimal searching of these codebooks is possible when the codebooks are further structured to position the pulses on a series of regularly spaced time-tracks throughout the excitation vector. These are known as Algebraic Codebooks. Such codebooks are described in articles titled "A toll quality 8kb/s speech codec for the personal communications system (PCS)", IEEE Transactions on Vehicular Technology, Vol. 43, pp. 808-816, August 1994 by R. Salami, C. Laflamme, J-P Adoul and D. Massaloux and "GSM Enhanced Full Rate Speech Codec", Proceedings of ICASSP, pp 771-774, April 1997, by K Jarvinen, J. Vainio, P. Kapanen, T. Honkanen, P. Haavisto, R. Salami, C. Laflamme and J-P Adoul.

One advantage of this particular scheme is that the codebook itself does not need to be stored. However, as the bit rate available to represent the excitation codebook decreases, the number of pulses available to be used diminishes quickly. This leads to annoying artefacts in the quality of synthesised unvoiced speech, see for example "Removal of Sparse-Excitation Artefacts in CELP", Proceedings of ICASSP, May 1998, by R. Hagen, Erik Ekudden, B. Johansson and W.B. Kleijn.

A fourth approach for reducing the computational complexity is through the use of Vector Sum Excitation and is described in "Vector Sum Excited Linear Prediction (VSELP) Speech Coding at 8kbps", Proceedings of ICASSP, pp 461-464, April 1990, by L A. Gerson and M. A. Jasiuk. This scheme is shown in FIG. 1 and is an improved prior art codec arrangement, based on vector addition/subtraction to simulate and obtain a more accurate representation of the speech excitation signal.

In FIG. 1, there is shown a general block diagram of code excited linear predictive speech coder 100 utilising the excitation signal generation technique according to the present invention. An acoustic input signal to be analysed is applied to speech coder 100 at microphone 102. The input signal, typically a speech signal, is then applied to filter 104. Filter 104 generally will exhibit band-pass filter characteristics.

The analog speech signal from filter 104 is then converted into a sequence of N pulse samples, and the amplitude of each pulse sample is then represented by a digital code in an analog-to-digital (A/D) converter 108, as generally known in the art. The sampling rate is determined by sample clock SC, which represents an 8.0 KHz rate in the prior art embodiment described in FIG. 1. The sample clock SC is generated along with the frame clock FC via clock 112.

The digital output of A/D 108, which may be represented as input speech vector s(n), is then applied to coefficient analyser 110. This input speech vector s (n) is repetitively obtained in separate frames, i.e. blocks of time, the length of which is determined by the frame clock FC. In the preferred embodiment, input speech vector s (n), where 1 <= n <= N, represents a 4 msec frame containing N = 40 samples. Each sample is represented by 12 to 16 bits of a digital code. For each block of speech, a set of linear predictive coding (LPC) parameters are produced in accordance with prior art techniques by coefficient analyser 110.

The short term predictor parameters (STP), long term predictor parameters LTP, weighting filter parameters (WFP), and excitation gain factor γ (along with the best excitation codeword I, as described later) are applied to multiplexer 150 and sent over the channel for use by the receiving speech synthesiser. The article titled "Predictive Coding of Speech at Low Bit Rates," IEEE Trans. Commun., Vol. COM-30, pp. 600-14, April 1982, by B.S. Atal, provides an indication of representative methods for generating these parameters. The input speech vector s(n) is also applied to subtractor 130, the function of which will be described later.

Basis vector storage block 114 contains a set of M basis vectors V_m(n), where 1 < = m <= M, each basis vector comprised of N samples, wherein 1= n = N. These basis vectors are used by codebook generator 120 to generate a set of 2^M-1. Each of the M basis vectors are comprised of a series of random white Gaussian samples.

Codebook generator 120 utilises the M basis vectors v_m(n) and a set of 2^M excitation codewords I, where 0 = i = 2^M-1, to generate the 2^M excitation vectors u, (n). In the prior art embodiment, each codeword L=i. If the excitation signal were coded at a rate of 0.25 bits per sample, for each of the 40 samples (such that M=10), then there would be 10 basis vectors used to generate the 1024 excitation vectors.

For each individual excitation vector u, (n), a reconstructed speech vector s' , (n) is generated to be compared with the input speech vector s (n). Gain block 122 scales the excitation gain factor (γ) may be pre-computed by coefficient analyser 110 and used to analyse all excitation vectors as shown in FIG. 1, or may be optimised jointly with the search for the best excitation codeword I and generated by codebook search controller 140.

The scaled excitation signal vu, (n) is then filtered by long term predictor filter 124 and short term predictor filter 126 to generate the reconstructed speech vector s' , (n). Filter 124 utilises the long term predictor parameters LTP to introduce voice periodicity, and filter 126 utilises the short term predictor parameters STP to introduce the spectral envelope. Note that blocks 124 and 126 are actually recursive filters that contain the long term predictor and short term predictor in their respective feedback paths.

The reconstructed speech vector s' , (n) for the i-th excitation code vector is compared to the same block of the input speech vector s(n) by subtracting these two signals in subtractor 130. The difference vector e, (n) represents the difference between the original and the reconstructed blocks of speech. The difference vector is perceptually weighted by weighting filter 132, utilising the weighting filter parameters WTP that are generated by coefficient analyser 110. The preceding reference details a representative weighting filter transfer function. Perceptual weighting is a technique which accentuates those frequencies where the error is perceptually more important to the human ear, and attenuates other frequencies.

Energy calculator 134 computes the energy of the weighted difference vector e', (n) , and applies this error signal E, to the codebook search controller 140. The codebook search controller 140 compares the i-th error signal for the present excitation vector u, (n) against previous error signals to determine the excitation vector producing the minimum error. The code of the i-th excitation vector having a minimum error is then output over the channel as the best excitation code I. In the alternative, the codebook search controller 140 may determine a particular codeword that provides an error signal having some predetermined criteria, such as meeting a predefined error threshold.

In each speech sub-frame, such a VSELP speech codec, attempts to match the ideal excitation of an all-pole model of the vocal tract by summing a small number of basis vectors (or their negatives) on a sample-by-sample basis to the most appropriate part of the excitation from the previous sub-frame. Scaling of both components is calculated to minimise the weighted mean squared error (MSE) between the synthesised speech and the input speech signals. The VSELP vectors are all of a length equal to that of one speech sub-frame. Hence, whether speech is voiced (repetitive pattern to the excitation) or unvoiced (essentially random excitation) and, if voiced, irrespective of the phase of the input speech signal relative to the sub-frame boundary, the VSELP basis vectors must be constructed to minimise the weighted MSE.

There are advantages in speech quality if the excitation vector is chosen depending upon whether the speech to be synthesised is voiced or unvoiced, NTT has produced speech codecs, including the JDC/PDC Half-Rate and a submission to the ITU-T SGI 6 G.4kb/s codec selection, which make use of a Pitch Synchronised CELP codebook. This is described in the paper titled "Design of a Toll-Quality 4 kbit/s Speech Coder Based on Phase- Adaptive PSI-CELP", Proc. ICASSP, pp. 755-758, April 1997, by K. Mano. However, apart from being pitch synchronised there is no structure to this codebook and it follows the original CELP codec paradigm.

, With any codec arrangement, there is always some loss in speech quality when trying to encode and decode speech signals. The aim of any codec designer is to minimise the perceived differences between the real speech and the encoded/decoded speech to the user. Hence, the codec designer is always trying to improve the codec's robustness to channel errors in the codebook index, and by providing an efficient codebook searching algorithm to minimise the inherent delay incorporated by searching codebooks. A number of CELP codecs search all of the samples held in the codebook, for example 1024 separate samples, which is very inefficient.

As clarified by the aforementioned background of existing CELP -based speech coding techniques, and their inherent problems, a need exists to provide an improved speech coding technique and method of operation.

Summary of the Invention

According to the to a first aspect of the invention there is provided a speech coder for a speech communications unit. The speech coder includes analysis means for analysing an incoming speech signal to determine a particular characteristic of the speech signal and basis vector generating means, operably coupled to the analysis means, for generating a series of basis vectors based on the determined characteristic.

In this manner, the series of basis vectors in the speech coder are optimally generated in accordance with a received speech signal. Preferably the speech signal is a voiced speech signal - voiced speech being characterised as those portions of the received signal waveform that are highly periodic with period equal to the pitch as compared to an unvoiced speech signal that can be construed as a a random waveform. In the preferred embodiment of the invention, the determined characteristic is the pitch of the incoming speech signal, such that the series of basis vectors are generated according to a series of said pitch determinations, such that at least one set of basis vectors is used to model speech of different pitch period. Preferably the speech coder further includes selecting means, operably coupled to the analysis means for selecting a portion of the series of basis vectors according to the phase of the incoming speech signal. The conventional VSELP basis vector set are generated using a phase synchroniser operably coupled to the analysis means and the basis vector generator means for synchronising the basis vector phase.

The speech coder in the preferred embodiment of the invention, further includes a long- term predictor operably coupled to the phase synchroniser, for matching an energy profile of the present speech signal with a speech characteristic in the long term predictor, to generate the series of basis vectors based on the determined phase characteristic.

In an alternative embodiment of the first aspect of the present invention the speech coder includes use of a number of codebooks, such that a first codebook having a first length is used to model unvoiced speech. A second codebook of a second length is used to model voiced speech of pitch periods upto a predefined threshold. A third codebook of a third length can then be used to model voiced speech of pitch periods above a predefined threshold. The inventors have found, as a result of detailed investigation, analysis and inventive thinking, that the coupling of the basis vector codebook to the phase and pitch period of the incoming speech exploits the observation that adjustments to the long-term predictor memory require different magnitude - dependent upon the position within the pitch cycle. Larger magnitude adjustments are required near to the glottal pulse and smaller adjustments are required away from this epoch.

In a second aspect of the present invention a radio communications unit including the speech codec, as hereinbefore described, is provided.

In a third aspect of the present invention a method of generating basis vectors in a speech coder, is provided. The method includes the step of examining an energy profile of an excitation sequence of a first portion of a speech signal prior to analysing a second portion of the speech signal.

Preferably, the method further includes the step of determining a particular characteristic of the speech signal, and generating a series of basis vectors based on the determined characteristic, for example where the determined characteristic is a pitch of the incoming speech signal, such that the series of basis vectors are generated according to a series of said pitch determinations. In the preferred embodiment, portions of the incoming speech signal are selected, and the series of basis vectors are generated according to the determined characteristic of said selected portion.

Preferably, the step of matching an energy profile of the present speech signal with a speech characteristic is performed in a long term predictor to generate the series of basis vectors based on the determined phase characteristic, where at least one set of basis vectors can be used to model speech of different pitch periods.

A preferred embodiment of the invention will now be described, by way of example only, with reference to the drawings.

Brief Description of the Drawings

FIG. 1 shows a prior art block diagram of a VSELP codec arrangement. FIG. 2 shows a block diagram of a VSELP codec excitation generation arrangement in accordance with a preferred embodiment of the invention.

FIG. 3 shows a block diagram of a pitch-synchronous VSELP codec arrangement for a Primary Excitation Source according to a preferred embodiment of the invention.

Detailed Description of the Drawings

Referring to FIG. 2, a speech codec arrangement is shown, based on vector addition/subtraction to simulate the speech signal, in accordance with a preferred embodiment of the invention. There is shown a general block diagram of a code excited linear predictive speech coder 200 utilising the excitation signal generation technique. An acoustic input signal to be analysed is applied to speech coder 200 at microphone 202. The input signal, typically a speech signal, is then applied to filter 204. Filter 204 will generally exhibit band-pass filter characteristics. However, if the speech bandwidth is adequate, filter 204 may be a direct wire connection.

The analog speech signal from filter 204 is then converted into a sequence of N pulse samples, and the amplitude of each pulse sample is then represented by a digital code in analog-to-digital (A/D) converter 208, as is known in the art. The sampling rate is determined by sample clock SC, which represents an 8.0 KHz rate in the preferred embodiment. The sample clock SC is generated along with the frame clock FC via clock 212.

In the preferred embodiment of the invention, Super Basis vector storage block 214 is now coupled to the LTP 224 and contains several sets of M super basis vectors V_cm(n'), wherein 1 = m = M. Each set of super basis vectors is used to model speech of different pitch period which can be derived from the LTP parameters, as compared to prior art basis vectors which are arranged to be of a length equal to a speech sub-frame. One codebook of length N samples is used to model unvoiced speech, a second codebook of length N| is used to model voiced speech of pitch periods upto Nj and codebooks N₂ N₃ ... etc. are used to model speech of pitch periods from Ni to N₂ and so on. Therefore the value of n' will differ depending upon the pitch range of the VSELP super basis vector set.

From these sets of super basis vectors, a conventional VSELP basis vector set, each of length N samples, is derived. If the speech to be synthesised is voiced, as determined by the voiced/unvoiced block 223, then the basis vectors are synchronised with the energy profile of the component of the LTP memory that is applicable to the current speech frame under consideration. This is achieved by Phase Synchronizer 228 and stored in temporary basis vector storage 221. These temporary basis vectors are used by codebook generator 220 to generate a set of 2^M-1 possible sequences as in the VSELP technique.

Codebook generator 220 utilises the M basis vectors v_m(n) and a set of 2^M excitation codewords I, where 0 = i = 2^M-1, to generate the 2^M excitation vectors u, (n). In the present embodiment, each codeword I,=i. If the excitation signal were coded at a rate of 0.25 bits per sample for each of the 40 samples (such that M=10), then there would be 10 basis vectors used to generate the 1024 excitation vectors.

For each individual excitation vector u, (n), a reconstructed speech vector s' , (n) is generated for comparison to the input speech vector s (n). Gain block 222 scales the excitation gain factor γ which may be pre-computed by coefficient analyser 210 and used to analyse all excitation vectors as shown in FIG. 2, or may be optimised jointly with the search for the best excitation codeword I and generated by codebook search controller 240. The scaled excitation signal vu, (n) is then filtered by long term predictor filter 224 and short term predictor filter 226 to generate the reconstructed speech vector s' , (n). Filter 224 utilises the long-term predictor parameters LTP to introduce voice periodicity, and filter 226 utilises the short-term predictor parameters STP to introduce the spectral envelope. Note that blocks 224 and 226 are actually recursive filters that contain the long term predictor and short term predictor in their respective feedback paths.

\ The reconstructed speech vector s' , (n) for the i-th excitation code vector is compared to the same block of the input speech vector s(n) by subtracting these two signals in subtractor 230. The difference vector e, (n) represents the difference between the original and the reconstructed blocks of speech. The difference vector is perceptually weighted by weighting filter 232, utilising the weighting filter parameters WFP generated by coefficient analyser 210. For further details on the WFP, refer to the preceding reference for a representative weighting filter transfer function. Such perceptual weighting, as previously indicated, accentuates those frequencies where the error is perceptually more important to the human ear, and attenuates other frequencies.

Energy calculator 234 computes the energy of the weighted difference vector e', (n) , and applies this error signal E, to codebook search controller 240. The search controller compares the i-th error signal for the present excitation vector u, (n) against previous error signals to determine the excitation vector producing the minimum error. The code of the i-th excitation vector having a minimum error is then output over the channel as the best excitation code I. In the alternative, codebook search controller 240 may determine a particular codeword that provides an error signal having some predetermined criteria, such as meeting a predefined error threshold.

In each speech sub-frame, a VSELP speech codec, attempts to match the ideal excitation of an all-pole model of the vocal tract by summing a small number of basis vectors (or their negatives) on a sample-by-sample basis to the most appropriate part of the excitation from the previous sub-frame. Scaling of both components is calculated to minimise the weighted mean squared error (MSE) between the synthesised speech and the input speech signals.

The method to achieve synchronisation of the pitch synchronous codebook, according to the preferred embodiment of the invention, is described with reference to the method of FIG. 3. Referring to FIG 3, a pitch-synchronous VSELP codebook is synchronised by examining the energy profile of the combined excitation sequence comprising the long-term predictor component for the current sub-frame and the previously stored combined codebook and long-term predictor excitation. In FIG. 3 Super basis vectors 314, 316, 318 represent an optimised set of excitation parameters for pitches commensurate with that for the current speech being processed. From this set of super basis vectors a set of conventional VSELP basis vectors 324, 326 and 328 are derived by phase adjustment and repetition, if necessary, by function 320.

The phase adjustment is performed by analysis of a combination of the stored previous combined excitation 350 and the scaled long-term predictor component 346. The scaled long term predictor component is derived by repeating the stored previous combined excitation 350 with a delay equal to the pitch period for the speech currently being analysed to derive the LTP excitation vector 352. This is scaled in multiplier 354 by the gain Y_LTP in order to minimise the weighted squared synthesis error from the input speech segment. The phase adjustment is made by analysing the most recent samples over one pitch period of the combined excitations 346, 350, labelled as segment 348, in function 356 to locate the position of the maximum energy peak.

The conventional VSELP basis vectors are multiplied by all possible combinations of "+/-1" in multipliers 330, 332 and 334. They are then summed in summer 336 in order to find the best combination of "+l"s & "-l"s which minimises the weighted squared synthesis error to derive the VSELP excitation vector for the current speech sub-frame 338. The VSELP excitation vector is multiplied in multiplier 340 by the gain Y_VSELP and summed in adder 342 with the scaled long term predictor excitation vector to derive the combined excitation vector 344.

As an alternative to the method illustrated in FIG 3 it is possible to derive the phase for the conventional VSELP basis vectors by analysis of the previous combined excitation 350 without the long-term predictor component. This embodiment is useful when for some reason the long-term predictor component is not available or unreliable. The derivation of a set of conventional VSELP basis vectors from the super basis vectors is conducted in the same manner as previously described. The benefits provided by the present invention, in particular with regard to providing a pitch-synchronised VSELP codebook, lie in the better modelling of the stochastic excitation necessary to update the long-term predictor state for voiced speech. In a conventional VSELP speech coder, the VSELP basis vectors may be optimised to derive updates to the long-term predictor state or adaptive codebook. However, this must be a compromise over all speech types; voiced and unvoiced, mixed gender, differing pitch periods and all pitch phases etc. Pitch synchronised VSELP codebooks may also be optimised, but since different pitch periods and pitch phases are distinguished, the codebook may be optimised to provide better overall performance. This optimisation may be used to reduce the necessary bit-rate of the speech codec, reduce the codebook complexity, or improve synthesised speech quality.

In this manner, the current invention provides higher quality speech, for a given bit rate, than the conventional VSELP codebook paradigm. Furthermore, the present invention provides a more representative speech signal, by determining a characteristic of the speech signal, say pitch of the incoming signal, to generate an improved 'super' set of basis vectors. In addition, since the codebooks are of a similar size to conventional VSELP codebooks, and the selection of the codebook is performed by simple processes, the additional complexity is relatively small.

Hence, an improved speech coding technique and method of operation has been provided for minimal additional computational complexity and memory requirements associated with exhaustive codebook searching.

Claims

1. A speech coder for a speech communications unit comprising analysis means for analysing an incoming speech signal to determine a particular characteristic of the speech signal; and basis vector generating means, operably coupled to the analysis means, for generating a series of basis vectors based on the determined characteristic.

2. A speech coder according to claim 1, wherein the speech signal is a voiced speech signal.

3. A speech coder according to claim 1 or 2, wherein the determined characteristic is a pitch of the incoming speech signal, such that the series of basis vectors are generated according to a series of said pitch determinations.

4. A speech coder according to any of the preceding claims further comprising selecting means operably coupled to the analysis means for selecting a portion of the incoming speech signal and generating the series of basis vectors according to the determined characteristic of said portion.

5. A speech coder according to claim 4, further comprising a phase synchroniser operably coupled to the analysis means and the basis vector generator means for synchronising a phase of a present speech signal to previous or perceived future speech signals.

6. A speech coder according to claim 5, further comprising a long-term predictor operably coupled to the phase synchroniser for matching an energy profile of the present speech signal with a speech characteristic in the long term predictor to generate the series of basis vectors based on the determined phase characteristic.

7. A speech coder according to any one of claims 4 to 6, wherein sets of basis vectors are generated and stored in sub-codebooks according to the determined characteristic of the selected portions of the incoming speech signal.

8. A speech coder according to any of the preceding claims wherein at least one set of basis vectors is used to model speech of different pitch period.

9. A speech coder in accordance with claim 8 when dependent upon claims 6 or 7, wherein the pitch period is derived from the long-term predictor.

10. A speech coder according to any of the preceding claims wherein a first codebook having a first length is used to model unvoiced speech and a second codebook of a second length is used to model voiced speech of pitch periods upto a predefined threshold.

11. A speech coder according to claim 10, further comprising a third codebook of a third length used to model voiced speech of pitch periods above a predefined threshold.

12. A speech coder according to claim 6 further comprising a short-term predictor, wherein the long-term predictor introduces voice periodicity into the reconstructed speech and the short term predictor introduces a spectral envelope to the reconstructed speech.

13. A radio communications unit having a speech coder according to any of the preceding claims.

14. A method of generating a set of basis vectors in a speech coder, the method comprising the step of examining an energy profile of an excitation sequence of a first portion of a speech signal prior to analysing a second portion of the speech signal.

15. A method of generating a set of basis vectors according to claim 14, the method further comprising the step of extracting a portion from an appropriate set of basis vectors.

16. A method of generating basis vectors according to claim 14, the method further comprising the step of determining a particular characteristic of the speech signal; and generating a series of basis vectors based on the determined characteristic.

17. A method of generating basis vectors in a speech coder according to claim 16, wherein the determined characteristic is a pitch of the incoming speech signal, such that the series of basis vectors are generated according to a series of said pitch determinations.

18. A method of generating basis vectors in a speech coder according to claim 14-to 16 further comprising the step of selecting a portion of the incoming speech signal and generating the series of basis vectors according to the determined characteristic of said selected portion.

19. A method of generating basis vectors in a speech coder according to any one of claims 14 to 18, further comprising the step of synchronising a phase of a present speech signal to previous or perceived future speech signals.

20. A method of generating basis vectors in a speech coder according to claim 19, further comprising the step of matching an energy profile of the present speech signal with a speech characteristic in a long term predictor to generate the series of basis vectors based on the determined phase characteristic.

21. A method of generating basis vectors in a speech coder according to any one of claims 14 to 20, further comprising the step of storing the generated sets of basis vectors in sub-codebooks according to the determined characteristic of the selected portions of the incoming speech signal.

22. A method of generating basis vectors in a speech coder according to any one of claims 14 to 21, further comprising the step of using at least one set of basis vectors to model speech of different pitch period.

23. A method of generating basis vectors in a speech coder according to any one of claims 14 to 22, further comprising the step of deriving the pitch period from a long term predictor.

24. A method of generating basis vectors in a speech coder according to any one of claims 14 to 23, further comprising the steps of using a first codebook having a first length to model unvoiced speech and using a second codebook of a second length to model voiced speech of pitch periods upto a predefined threshold.

25. A method of generating basis vectors in a speech coder according to claim 23, further comprising the step of using a third codebook of a third length to model voiced speech of pitch periods above the predefined threshold.

26. A method of generating basis vectors in a speech coder according to claim 23, further comprising the step introducing voice periodicity into reconstructed speech using the long term predictor and introducing a spectral envelope to the reconstructed speech using a short term predictor.

27. A method of generating basis vectors in a speech coder, the method comprising the step of examining an energy profile of an excitation sequence of a first portion of a speech signal prior to analysing a second portion of the speech signal. λ

28. A speech coder substantially as hereinbefore described with reference to, or as illustrated by, FIG. 2 of the drawings.

29. A method of generating basis vectors in a speech coder substantially as hereinbefore described with reference to, or as illustrated by, FIG. 3 or FIG. 4 of the drawings.