EP0619574A1 - Speech coder employing analysis-by-synthesis techniques with a pulse excitation - Google Patents

Speech coder employing analysis-by-synthesis techniques with a pulse excitation Download PDF

Info

Publication number
EP0619574A1
EP0619574A1 EP94105438A EP94105438A EP0619574A1 EP 0619574 A1 EP0619574 A1 EP 0619574A1 EP 94105438 A EP94105438 A EP 94105438A EP 94105438 A EP94105438 A EP 94105438A EP 0619574 A1 EP0619574 A1 EP 0619574A1
Authority
EP
European Patent Office
Prior art keywords
signal
samples
filtering
long
excitation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP94105438A
Other languages
German (de)
French (fr)
Inventor
Luca Cellario
Danielle Sereno
Willem Bastiaan Kleijn
Peter Kroon
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SIP SAS
Telecom Italia SpA
AT&T Corp
Original Assignee
SIP SAS
SIP Societa Italiana per lEsercizio delle Telecomunicazioni SpA
AT&T Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SIP SAS, SIP Societa Italiana per lEsercizio delle Telecomunicazioni SpA, AT&T Corp filed Critical SIP SAS
Publication of EP0619574A1 publication Critical patent/EP0619574A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/12Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a code excitation, e.g. in code excited linear prediction [CELP] vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/10Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a multipulse excitation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L2019/0001Codebooks
    • G10L2019/0003Backward prediction of gain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L2019/0001Codebooks
    • G10L2019/0011Long term prediction filters, i.e. pitch estimation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L2019/0001Codebooks
    • G10L2019/0012Smoothing of parameters of the decoder interpolation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L2019/0001Codebooks
    • G10L2019/0013Codebook search algorithms
    • G10L2019/0014Selection criteria for distances
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present invention relates to speech coders employing analysis-by-synthesis techniques, and more particularly to a coder for low-bit-rate applications, preferably at the lowest limits of the range of rates for which the above-mentioned coders can be used with good performance, e.g. rates within the 4 - 8 kbit/s range.
  • Speech coders to be used for the so-called half-rate channel of the European mobile radio system.
  • the excitation signal for the synthesis filter simulating the speech production apparatus is chosen within a set of excitation signals so as to minimize a perceptually meaningful measure of distortion. This is commonly obtained through the comparison of the synthesized samples and of the corresponding samples of the original signal and the simultaneous weighting, in a suitable filter, with a function that takes into account how human perception evaluates the resulting distortion.
  • the synthesis filter includes a cascade of two elements that impose short-term and long-term spectral features, respectively, on the excitation signal.
  • the former ones are linked to the correlation among subsequent samples, which generates a non-flat spectral envelope, and the latter ones are linked to the correlation between adjacent pitch periods, on which the fine signal spectral structure depends.
  • the coded signal includes information relating to excitation and to short-term synthesis parameters (short-term linear prediction coefficients or other quantities related to them) and long-term ones (long-term delay and linear prediction coefficients).
  • the paper suggests to modify the original signal so that long-term predictor parameters become known functions of time and allow a direct interpolation without degrading performance.
  • the suggested modifications consist of limited time oscillations and small amplitude scalings of the original signal. Time oscillations can be carried out in discrete manner. The need for inserting these time oscillations, and therefore for setting an optimal amount thereof, obviously increases the coder complexity.
  • a coding system in which, before long-term analysis, discrete time shifts are introduced on the residual signals and in which the search for optimal excitation signal and optimal shift is carried out so as to reduce complexity of computations.
  • the coder receives samples x(n) of the speech signal to be coded, grouped into blocks (commonly called 'frames') including a fixed number Lf of contiguous samples. Every frame of Lf samples is then divided into subframes of Ls contiguous samples.
  • the coder must determine a set of parameters to be transmitted to the decoder so that the decoder is able to synthesize a signal that approximates the original signal. To achieve this, an analysis-by-synthesis procedure is used, through which the coder analyzes the effects of the possible values of each parameter and chooses the value that enables obtaining the best approximation of the original signal.
  • the coder will contain a replica of the decoder to produce, for each of said values, the corresponding output signal.
  • both long-term and short-term correlations of the speech signal are exploited, imposed on an excitation signal through respective synthesis filters.
  • the coder carries out a linear prediction analysis (short-term or LPC analysis) and computes the short-term residual signal, that is used to compute parameters (delay and coefficient) of the long-term synthesis filter. (The coefficient is unique in the preferred embodiment, since a first-order filter is used).
  • both the delay and the coefficient are interpolated when the delays of the current frame and the previous frame are close in value.
  • the shift amount is determined through an exhaustive search in a range of possible values so as to minimize the energy of the error (difference between original signal and reconstructed signal). After having determined the optimal shift, the search for the optimal excitation signal is carried out.
  • CELP Codebook Excited Linear Prediction
  • the coded signal will include information related to short-term and long-term synthesis filter parameters and to the optimal excitation, transmitted as usual in the form of suitably coded indexes.
  • an excitation signal corresponding to the one used by the coder will be retrieved and filtered in the chain of a long-term synthesis filter and a short-term synthesis filter to provide a reconstructed signal that can be still subjected to a further filtering (post-filtering), based for example on short-term synthesis parameters, to improve the subjective signal quality.
  • post-filtering based for example on short-term synthesis parameters
  • Samples read in MT are supplied to a high-pass filter FPA whose task is removing d.c. drifts and low-frequency noise, and the filtered signal x f (n) is supplied to short-term analysis circuits STA and to a linear prediction filter LPC.
  • Circuits STA are to determine, for each frame, a set of P linear prediction coefficients a i (e.g. 10), to convert these coefficients into a group of parameters in the frequency domain, commonly known as LSP (Line Spectrum Pairs) and to carry out a quantization, for example a scalar one, of the differences between adjacent parameters.
  • Indexes j(f) that are part of the coded signal, are transmitted to the decoder through a connection 2a after binary coding in circuits that are not shown. Conversion into line spectrum pairs is desirable since, as well known, spectrum lines have properties of quantization, interpolation and check of synthesis filter stability that are better than those of the coefficients.
  • a smoothing of spectrum information related to formants is also carried out to match it to the quantization circuit resolution. This is accomplished by multiplying computed coefficients a i by a respective factor g1 i , whose value is typically less than 1 but quite near 1. This operation allows reducing the risk, in case of particularly narrow formants, of reproducing after quantization formants that are equally narrow, but shifted with respect to the original ones, and therefore reduces a possible cause for the degradation of coded signal quality.
  • the circuit STA computes coefficients a i according to the classical autocorrelation method, as described in "Digital Signal Processing of Speech Signals” by L.R. Rabiner and R.W. Schafer (Prentice - Hall Ed., Englewood Cliffs, N.J., USA, 1978), p. 401.
  • STA operates on a set of Lf+P input samples (in particular, the samples that occupy the last Lf+P positions in MT), obtained through a trapezoidal window that weights with a maximum weight (particularly 1) all samples except for the first and the last P ones, for which the weights have been determined with a simple linear interpolation operation between minimum and maximum weight: in this way, smoothing, that is required by the autocorrelation method to provide good results, is limited to the overlapping area between contiguous windows.
  • the forward positioning of the window also takes into account the fact that, when coding the initial subframes of a frame (e.g.
  • coefficients are used which are obtained by the conversion of line spectrum pair values determined through interpolation between values related to the previous frame and values related to the current frame. This ensures a gradual transition between current frame parameters and previous frame parameters.
  • the window as explained, it encompasses or spans over a current frame and the subsequent frame in the meaning that it comprises samples of both frames without, however, having to comprise two full frames.
  • STA The operations of STA are typical of any linear prediction coder, and therefore a more detailed description is not necessary.
  • the indexes j( ⁇ ) are also supplied to a linear prediction coefficient reconstructing circuit STR1 that supplies filter LPC, short-term synthesis filters STS1, STS' and spectral weighting filters SW, SW' with quantized values of the coefficients, obtained by applying inverse procedures with respect to the ones used to transform the coefficients into line spectrum pairs.
  • STR1 also computes interpolated values to be used in the first three subframes. To simplify, in the following, the quantized values are also designated a i .
  • the filter LPC receives the filtered speech signal samples x f (n) and filters them according to the conventional function generating the short-term prediction residual r s (n), that is supplied both to a low-pass filter FPB, that produces a filtered residual signal r f (n), and to time shift circuits TS, that produce a modified residual signal r m (n).
  • Low-pass filtering facilitates, as well known, operations of a following long-term analysis circuits LTA.
  • the circuits LTA must determine, at each frame, and supply afollowing long-term synthesis filter LTS1 with the delay d (pitch period) with which a sample of an excitation signal is used to generate a reconstructed signal and the gain or coefficient b with which said sample is weighted.
  • the block LTA computes the delay d by maximizing the autocorrelation function where k can vary between a minimum value and a maximum value allowed for the delay d (e.g., 20 and 120), and x is a preset number, whose purpose is causing the length of the window taken into account for the calculation to enable obtaining a satisfactory value for d.
  • the window must include the most recent samples, as already said, its length is a compromise between two opposed needs: the greater the length, the most accurate the evaluation; on the other hand, the shorter the window, the more its center is next to the end of the frame to be coded (Lf samples) and therefore it allows obtaining a current value next to that end, what is required for interpolation.
  • x can be K.
  • the delay is never less than the length of a subframe, and this simplifies considerably subsequent operations.
  • the value computed with (1) can also be subjected to corrections, that will be examined afterwards, aimed at guaranteeing a shape as much as possible smooth for d and compensating for synchronism losses due to the time shift.
  • b R[r f (d)]/E(r f ) , where E(r f ) indicates the energy
  • E(r f ) indicates the energy
  • a minimum and a maximum, 0 and 1 respectively, are also set for the value of b. Values that are less than 0 are excluded because they would correspond to a signal overturning, that would also compel to transmit a sign bit, while values that are greater than 1 make the filter unstable, as well known.
  • the value of b computed using (2) can also be subjected to corrections aimed at guaranteeing the best quality of the coded signal. Furthermore, in certain frames, instead of the values d and b computed with (1) and (2), it is possible to use values obtained by linear interpolation between values computed for the previous frame and values computed for the current frame.
  • the prediction gain G is also computed: this is a quantity representing the ratio between the energies of input and output signals from the long-term predictor and gives a measure of long-term prediction efficiency.
  • Gain G is defined by the expression where 1 - bR[r f (d)]/E'(r f )
  • Gain G allows establishing whether the speech segment being coded is voiced, that is indicated by values of G and b that are both greater than respective thresholds G thr , b thr .
  • LTA generates a flag V that is used to decide to carry out the interpolation and to introduce the time shift.
  • a first correction for delay d is based on the search for the local maximum of function (1) also in a given neighborhood (e.g., ⁇ 15%) of the value obtained at the previous frame: if this local maximum is different from the main maximum by an amount that is less than a certain limit, that new value is used that provides a more smooth outline that can be therefore interpolated.
  • This secondary search is carried out only if the signal in the previous frame was strongly voiced and had been subjected to interpolation.
  • the correction, if any, is carried out before computing b and G, so as to use the already corrected value of d for these computations.
  • a second correction is linked to the presence of the time shift mechanism, that inserts a variable delay whose effects can be compared to those of a non-synchronous operation of the coder.
  • each 8 kHz sample will originate eight samples at 64 kHz.
  • the correction can be carried out if interpolation is required in the current frame and if the speech segment is not voiced.
  • the first condition is necessary since, if the interpolation is absent, no shift is carried out; moreover, the signal must not be voiced because in this situation an even minimal modification of d with respect to the exact value can usually be perceived.
  • its absolute value is limited to a maximum value ⁇ d' ⁇ max , for example 1.
  • the correction is carried out only if it does not modify the decision about interpolation (that will be described afterwards) and does not take the value of d outside the provided range of values.
  • a first correction consists of clipping b to a first upper limit b1, since, if b is too high, an excessive energy increase would occur, which gives rise to noises.
  • the correction is carried out if the energy in the previous frame exceeds a certain threshold.
  • a further limitation for b is carried out in case of low values of G (less than G thr ), that show speech segments with low periodicity, while b is relatively high (greater than a second limit b2): in this case, the value b2 is employed, since employing the actual value could produce artifacts in the coded signal.
  • interpolation this is carried out if the relative variation of d between two consecutive frames does not exceed, as absolute value, a predetermined amount (e.g., 15%) and if the values of b in these frames are both positive.
  • a predetermined amount e.g. 15%
  • the actual computation of the values of d and b to be used in case of interpolation is carried out in the long-term synthesis filter LTS1, to which LTA sends a flag F when the above mentioned conditions are verified.
  • the same flag is also supplied to an error energy minimizing circuit EM determining the optimal time shift and excitation.
  • index j(d), j(b) are the information related to long-term analysis to be inserted into the coded signal, and that are transmitted to the decoder, after suitable coding, through connections 2b, 2c.
  • Index j(b) is determined through a quantization operation, during which, in addition to limiting the maximum value to 1, values of b that are less than half of the first quantized value are forced to 0. No quantization of d is however necessary, since d is already a discrete quantity: it is however preferable to transmit d under the form of an index for sake of uniformity with the other information.
  • the conversion of the values of d into indexes practically consists of their shift, such as to make the possible range of values begin from 1 instead of from a value d min .
  • 7 bits will be necessary to code index j(d), and these bits will also allow coding of values of j(d) outside the provided range.
  • index j(b) corresponding to the minimum value of b is transmitted.
  • circuits generating indexes j(b), j(d) are included into block LTA.
  • circuit LTA can take decisions related to the sound nature and the need to carry out interpolation and therefore shift.
  • LTA operations performed by LTA are described in detail in the appendix, that includes program listing in C language. Given the listing, a technician has no problem in designing devices that perform the described functions.
  • Indexes j(d), j(b) are reconverted into quantized or reconstructed values of the respective parameters by reconstructing circuits LTR1, composed of simple read-only memories addressed by the indexes.
  • LTR1 provides the actual values of d, b if j(d) shows a value allowed for the delay (that is, if j(d) is in the range 1 to 101). If j(d) shows any one of the values outside the allowed range (therefore its value is from 102 to 127), LTR1 provides value 0 for b and value d min for d.
  • This one is composed of a shape information (innovation), represented by one of the words s(n) of an innovation codebook IC1, by a positive or null amplitude parameter g (innovation gain), chosen in a codebook of innovation gains IG1, and by a sign information, represented by a parameter ⁇ (innovation sign) whose value is ⁇ 1.
  • codebook IG1 Even if, to facilitate understanding, codebooks IC1, IG1 are represented as circuit blocks (that could suggest the idea of memories that contain them), as said above, the particular structure of innovation codebook makes their storage superfluous. The structure of innovation and gain codebooks will be examined later.
  • Symbols d0, b0 show the values related to the current frame, d(-1), b(-1) those related to the previous frame.
  • the interpolation is therefore a linear one and extends over a whole frame.
  • the values of d(n) and b(n) then vary sample by sample.
  • d(n) it will generally not be an integer number: this means that the value of signal s s (n) at the continuous time instant n-d(n) does not coincide with that of an actually available sample and must be evaluated: according to the invention, evaluation is performed through a second order polynomial interpolation (that is through a parabola) centered about the discrete time instant that is nearest to n-d(n); the value thus evaluated is then multiplied by the interpolated value b(n).
  • the interpolation procedure adopted has an extremely lower computation complexity than more sophisticated interpolation methods based on signal filtering. However, its effect is essentially a low-pass one, that is useful for the good operation of the coder since it avoids that the reconstructed signal has too marked periodicity properties.
  • the reconstructed short-term residual s s (n) is supplied to the short-term synthesis filter STS1, whose transfer function is 1/1-A(z).
  • the reconstructed and weighted signal y w (n) is subtracted in an adder SM from the modified reconstructed and weighted signal x w (n) obtained by filtering the output signal from TS in the cascade of two filters STS', SW', respectively identical to STS1 and SW.
  • SM a weighted error signal e(n) is obtained, that is supplied to the error energy minimizing circuit EM that performs all necessary operations to determine optimal shift and excitation.
  • TS Purpose of circuits TS is aligning in time the signal to be coded with the replica that long-term synthesis filter is able to produce, and in particular avoiding shifts among pitch peaks in the signal predicted by LTS1 and in the original one.
  • TS at each subframe makes the time window of Ls samples, that locates the subframe itself, shift by a certain amount Dh.
  • the shift to be applied is determined by unit EM with a fast search procedure within a range of values defined by a maximum allowable shift. Shift is applied on the residual signal and not on the original one because the resulting distortion is smoothed by the following filtering in STS', SW' and therefore is substantially imperceptible.
  • the shift applied in a subframe is algebraically added to the one accumulated up to that time, providing a global shift ⁇ , in order to avoid too sudden variations.
  • Global shift also cannot exceed a certain maximum value (H samples of the original signal). The reason why H samples of the following frame have also been loaded in MT is therefore evident.
  • Purpose of the shift variation limitation is avoiding excessive distortions; the limitation related to global shift instead is determined by the delay that has to be tolerated in coding procedures and therefore by the availability of future samples.
  • Time shift has a resolution that is less than one sampling period of the original signal, and therefore it is necessary to carry out an upsampling of the residual signal.
  • circuit TS will include an upsampling circuit US (in practice an interpolating filter), that supplies at its output the upsampled residual r ⁇ s (n ⁇ ), and a shifting element SH that receives from EM information about shift entity ⁇ and generates the modified upsampled residual r ⁇ m (n ⁇ ).
  • upsampling ratio ⁇ is 8, and therefore the upsampled signal has a frequency of 64 kHz: this upsampling ratio provides a suitable resolution for all desired purposes.
  • Element SH will practically be a memory that loads, at each subframe, the ⁇ Ls samples of the upsampled residual plus a certain number of following and previous samples linked to the maximum allowed shift in a frame (in practice, a number of samples equal to twice the maximum shift, as will be explained in the description of optimal shift search); SH is addressed for reading by the error energy minimizing unit EM, in such a way as to supply the following circuits with Ls samples adequately shifted with respect to the incoming subframe.
  • the innovation codebook includes a certain number of words, each having Ls samples, of which only a very limited number is different from 0. This choice derives from the fact that, being the codebook quite limited, it would be an illusion to think to find inside it words with a lot of pulses (that is non-null samples) in which all pulses are actually suitable, and further enables reducing the amount of computations necessary when searching for the optimal excitation.
  • the codebook is composed of two parts. The first one includes Ls words having a single non-null sample, with amplitude equal to 1 and positive sign, and Ls-1 null samples.
  • the second part includes words with two samples whose amplitude is 1, and Ls-2 null samples. These words are generated starting from a limited number of key-words (in particular 3) with the method described in European Patent Application EP-A-0396121 in the name of CSELT. In the example taken into account, the three key-words have all the first pulse in position 0 and the second pulse in a respective key position n2(1), n2(2), n2(3), and the other words are obtained making the pulse pair shift towards a word end till the second pulse reaches such end or the first pulse reaches the respective key position.
  • Key positions are chosen in order to give origin to Ni2 (in particular 21) possible positions of the pulse pair; for each one of these positions, there are two words that are different one from the other by the second pulse sign, as described in said European Application, that take to Ls+2Ni2 (62 in the example) the total number of words in the innovation codebook.
  • the innovation codebook structure with few non-null samples and words obtained by shifting samples by one position starting from a limited number of keys, is a simple deterministic structure that enables a fast search procedure of the optimal excitation that requires neither codebook storage nor the effective filtering of the candidate excitation signal.
  • the test with words of the first part of the codebook must be carried out only if long-term analysis has indicated a voiced sound or, on the contrary, when strong energy concentrations are noted in short signal sections. These strong concentrations can in fact signal the onset of a voiced section, that cannot still be classified as such, since classification is based on long-term analysis and in the previous signal sections there were no useful features to indicate such onset. Under these conditions, therefore, filter LTS1 would indeed not be able to supply a correct predicted signal.
  • Words in the codebook are identified by a respective index j(s); the index related to the optimal word, adequately coded, is transmitted to the decoder through a connection 2d. Since in the described example the codebook includes 62 words, to which as many indexes j(s) correspond, without having to modify the number of bits coding j(s), two further values of j(s) are available that do not correspond to any word in the codebook.
  • gain g this is quantized using a codebook built so as to allow saving coding bits with respect to what would actually be necessary to represent all possible values provided in the codebook.
  • Information about gain, for each subframe, is represented in the form of two indexes j(gmax), j(gnor), the first one of which is linked to the maximum value of g in the frame, and the second one to the difference between such maximum value and the actual value, and by sign ⁇ . This information is transmitted to the decoder through a connection 2e.
  • the optimal value of g determined with the error minimizing procedure that will be described afterwards is quantized, generating a respective index j(g) that is not transmitted but is reconstructed in the decoder.
  • value j(gmax) related to the maximum frame gain is identified and is transmitted as such if it is not less than Nin; otherwise, index j(gmax) is forced to value Nin.
  • the actual value of index j(gnor) is transmitted only if it is not greater than Nin-1; otherwise, gain is deemed 0 (that is, innovation is silenced for subframes where gain is very small with respect to the maximum one) and index j(s) of the innovation word is forced to one of the values that do not correspond to any codebook word to show transmission of a word with null gain.
  • the gain codebook can be a logarithmic codebook, so that the ratio between two consecutive values is a constant.
  • the ratio must take into account several requirements:
  • the value of the ratio between two consecutive gain levels can range from 3 to 6 dB.
  • each of the filters has been divided into an element with null input (LTSa, STW1a, STW2a) that provides contribution of initial conditions (that is of filtering memories for previous subframes), and into an element (STW1b, STW2b) that is reset at each subframe (filtering with null initial conditions), as indicated by signal R supplied by a time base, not shown.
  • Filtering with null initial conditions of excitation is only the short-term filtering, since it has been supposed that delay d is not less than a subframe.
  • the optimal shift determination is composed of three steps:
  • the second step determines the lower and upper extremes ⁇ min , ⁇ max of a range that extends around shift value ⁇ accumulated so far in the frame.
  • Values ⁇ max , ⁇ min are initially fixed so that differences ⁇ max - ⁇ and ⁇ - ⁇ min have a prearranged value ⁇ ⁇ ⁇ h, for example 20 samples of the upsampled signal r ⁇ s . There exists therefore a maximum number of possible values (41 in the example) among which the optimal shift can be searched for.
  • the optimal shift value within the test range is the one minimizing energy of an error signal e1(n) represented by the difference between reconstructed and weighted modified signal x w (n) (Fig. 1) and contribution y w1 (n) of excitation filtering memories, and is obtained with a fast search procedure that allows reducing the amount of necessary computations.
  • output signal x w (n) from STW' can be expressed as (where n ranges from 0 to Ls-1), and on the other hand that the same signal is the sum of output x w1 of STW2a and output x w2 of STW2b.
  • the procedure to determine x w2 adopted according to the invention takes into account that, for a given shift value, signal x w2 is given by
  • the upper limit of the summation is the minimum between n and P, since when filtering with null initial conditions, samples with n-k ⁇ 0, that is, samples of the previous subframe, must not be taken into account.
  • Values of x w2 are actually computed according to (8) for a first group of ⁇ possible shifts that range from h max to ⁇ max - ⁇ +1; obviously, the tests will be stopped if by chance h min is reached before having examined all ⁇ shifts.
  • ⁇ values of x w2 must actually be computed according to (8) and (9), that is one for each of the ⁇ upsampled signal samples corresponding to a 8-kHz sampling period.
  • Unit EM directly computes an expression of the energy to be minimized that is function of the position of the pulses in the innovation word, and for this purpose the pulse response Q is employed, computed during search for the optimal shift. Computation of the pulse response is made convenient with respect to filtering execution by the fact that every word includes two non-null samples at most. Moreover, taking into account the more general case of the words with 2 pulses, the global pulse response is the sum of two responses spaced by a distance equal to the key; responses for all other words linked to a key are then obtained simply by a translation by one sample at a time. To simplify, in the following mathematical expressions, the variability range of the summation index for summations extended to all samples in a subframe has not been indicated.
  • the particular structure of the innovation codebook allows to directly obtain E(u) and R(e1u), that depend on the position of the pulse or pulses in the word, by exploiting the pulse response of filter STW1, that is equal to the one of filter STW2, previously determined.
  • the tests with words of the first part of the codebook are carried out only if strong energy concentrations in short times are noted,that can show the onset of a voiced signal section.
  • energy of a certain group of samples of the modified residual is computed (e.g. 5 samples), starting from the beginning of the subframe and shifting, by one sample at a time, the window selecting the group till the whole subframe has been scanned, and storing which group shows maximum energy.
  • the average power that is the energy divided by the number of samples
  • the average power in the window where the maximum occurred and the average power in the subframe are also computed.
  • the decoder receives from the coder, through connections 2a-2e, indexes j(j), j( ⁇ ), j(b), j(s), j(gmax), j(gnor) and sign ⁇ for the innovation gain.
  • LTR2 will include a read-only memory with two tables addressed by indexes j(d), j(b), like LTR1 (Fig. 1), in addition to a circuit suitable to store values of d, b related to two consecutive frames and to carry out the comparisons, described in connection with the coder, necessary to determine if interpolation of d, b is necessary.
  • Signal s s (n) outgoing from LTS2 is filtered in the short-term synthesis filter STS2 using coefficients a i generated in coefficient reconstructing circuit STR2 starting from indexes j( ⁇ ).
  • the reconstructed speech signal y(n) is still subjected to a further filtering in an adaptive filter PF that uses coefficients obtained from linear prediction coefficients a i and that inserts into the reconstructed speech signal a distortion that improves the perceptual effect.
  • an adaptive filter PF uses coefficients obtained from linear prediction coefficients a i and that inserts into the reconstructed speech signal a distortion that improves the perceptual effect.
  • PF At the output of PF, there is a filtered reconstructed signal y p (n).
  • Employ of filters like PF when coding a speech signal is well known to the technicians and does not require further explanations.
  • the decoder does not take into account the possible shift carried out into the coder: in fact, purpose of the shift is just causing the synthesized signal to be a replica as good as possible of the original signal, and therefore the decoder only requires information related to excitation and filters.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

In an analysis-by-synthesis coder, the original speech signal undergoes small time shifts to match in time the signal to be coded with the replica produced by the long term synthesis filter. The shift is determined at each subframe by an exhaustive search within a range of possible values so as to minimize the error signal energy. Once the optimal shift has been determined, the optimal excitation is searched for. The excitation is chosen in a codebook containing words with very few pulses arranged in a deterministic structure, which words are all obtained from a limited number of key words. The deterministic codebook structure allows a fast search for the optimal excitation, without need of storing the codebook and actually performing the synthesis filterings of the candidate excitations.

Description

  • The present invention relates to speech coders employing analysis-by-synthesis techniques, and more particularly to a coder for low-bit-rate applications, preferably at the lowest limits of the range of rates for which the above-mentioned coders can be used with good performance, e.g. rates within the 4 - 8 kbit/s range.
  • An example of this type of applications is represented by speech coders to be used for the so-called half-rate channel of the European mobile radio system.
  • In coders using analysis-by-synthesis techniques, for each block of speech signal samples to be coded, the excitation signal for the synthesis filter simulating the speech production apparatus is chosen within a set of excitation signals so as to minimize a perceptually meaningful measure of distortion. This is commonly obtained through the comparison of the synthesized samples and of the corresponding samples of the original signal and the simultaneous weighting, in a suitable filter, with a function that takes into account how human perception evaluates the resulting distortion.
  • In its most general form, the synthesis filter includes a cascade of two elements that impose short-term and long-term spectral features, respectively, on the excitation signal. The former ones are linked to the correlation among subsequent samples, which generates a non-flat spectral envelope, and the latter ones are linked to the correlation between adjacent pitch periods, on which the fine signal spectral structure depends. With such a scheme, the coded signal includes information relating to excitation and to short-term synthesis parameters (short-term linear prediction coefficients or other quantities related to them) and long-term ones (long-term delay and linear prediction coefficients).
  • The insertion of long-term features into the coded signal greatly enhances natural sounding of the signal, especially if the delay is updated at each subframe during the analysis-by-synthesis cycle; however,the related information would require most of the bits available for coding. Especially in case of low-bit-rate applications, it is therefore particularly interesting to search for solutions that enable a reduction of the amount of information to be transmitted to the decoder, while preserving signal quality.
  • In the paper "Generalized analysis-by-synthesis coding and its application to pitch prediction" presented by W.B. Kleijn, R.P. Ramachandran and P. Kroon at the ICASSP 92 Conference, San Francisco (California, USA), March 23-26 1992, paper I-337, it is suggested for this purpose to carry out a long-term analysis delay interpolation, the delay being updated at each frame. A direct interpolation, without adequate arrangements, would provide delay values that are not the optimal values and would provoke time misalignments among long-term spectral features in the original signal and in the synthesized signal, that generate a significant distortion.
  • To avoid these inconveniences, the paper suggests to modify the original signal so that long-term predictor parameters become known functions of time and allow a direct interpolation without degrading performance. The suggested modifications consist of limited time oscillations and small amplitude scalings of the original signal. Time oscillations can be carried out in discrete manner. The need for inserting these time oscillations, and therefore for setting an optimal amount thereof, obviously increases the coder complexity.
  • To solve this problem, according to the present invention, therefore, a coding system is provided in which, before long-term analysis, discrete time shifts are introduced on the residual signals and in which the search for optimal excitation signal and optimal shift is carried out so as to reduce complexity of computations.
       The invention characteristics are disclosed in the appended claims.
  • A preferred embodiment of the invention will now be described, with reference to the enclosed drawings, in which:
    • Fig. 1 is a block diagram of the coder;
    • Fig. 2 is a functional diagram of some blocks of the coder;
    • Fig. 3 is a block diagram of the decoder.
  • Before describing in detail the coder/decoder structure, the principles on which it is based will be summarized. The coder receives samples x(n) of the speech signal to be coded, grouped into blocks (commonly called 'frames') including a fixed number Lf of contiguous samples. Every frame of Lf samples is then divided into subframes of Ls contiguous samples. The coder must determine a set of parameters to be transmitted to the decoder so that the decoder is able to synthesize a signal that approximates the original signal. To achieve this, an analysis-by-synthesis procedure is used, through which the coder analyzes the effects of the possible values of each parameter and chooses the value that enables obtaining the best approximation of the original signal. For this purpose, the coder will contain a replica of the decoder to produce, for each of said values, the corresponding output signal. To generate these output signals, both long-term and short-term correlations of the speech signal are exploited, imposed on an excitation signal through respective synthesis filters. At each frame, the coder carries out a linear prediction analysis (short-term or LPC analysis) and computes the short-term residual signal, that is used to compute parameters (delay and coefficient) of the long-term synthesis filter. (The coefficient is unique in the preferred embodiment, since a first-order filter is used). To improve the resolution of long-term-correlation information, both the delay and the coefficient are interpolated when the delays of the current frame and the previous frame are close in value. To reduce the effects of time mismatches between the original signal and the reconstructed one, at each subframe small time shifts can be introduced in the original speech signal: the shift amount is determined through an exhaustive search in a range of possible values so as to minimize the energy of the error (difference between original signal and reconstructed signal). After having determined the optimal shift, the search for the optimal excitation signal is carried out.
  • In the following, to make the description clearer, the possible excitation signals will be considered as words chosen in a certain codebook, that is, reference is made to a type of coder known as CELP (Codebook Excited Linear Prediction), even if, as it will be seen, every word is made up of an extremely small number of pulses (preferably 1 or 2) with deterministically predefined amplitudes and positions, and the codebook is not stored.
  • The coded signal will include information related to short-term and long-term synthesis filter parameters and to the optimal excitation, transmitted as usual in the form of suitably coded indexes.
  • In the decoder, starting from these indexes, an excitation signal corresponding to the one used by the coder will be retrieved and filtered in the chain of a long-term synthesis filter and a short-term synthesis filter to provide a reconstructed signal that can be still subjected to a further filtering (post-filtering), based for example on short-term synthesis parameters, to improve the subjective signal quality. The reconstructed signal is then converted again into analogue form and supplied to utilization devices.
  • By way of example, in the following description reference will be made to frames with length Lf = 160 samples (that, with a 8-kHz sampling frequency, correspond to a speech signal segment whose length T = 20 ms), divided into 8 subframes whose length Ls = 20 samples. For reasons related to the introduction of time shifts, it is necessary to have available, in addition to the Lf samples of a frame, a group of H+K samples of the following frame (e.g. H = 24, K =8).
  • With reference to Fig. 1, the input signal samples x(n) present on a connection 1 are temporarily stored in a buffer MT arranged to store N = Lf+H+K
    Figure imgb0001
    samples, and every T ms a block of Lf samples will be written and read. Samples read in MT are supplied to a high-pass filter FPA whose task is removing d.c. drifts and low-frequency noise, and the filtered signal xf(n) is supplied to short-term analysis circuits STA and to a linear prediction filter LPC.
  • Circuits STA are to determine, for each frame, a set of P linear prediction coefficients ai (e.g. 10), to convert these coefficients into a group of parameters in the frequency domain, commonly known as LSP (Line Spectrum Pairs) and to carry out a quantization, for example a scalar one, of the differences between adjacent parameters. Indexes j(f), that are part of the coded signal, are transmitted to the decoder through a connection 2a after binary coding in circuits that are not shown. Conversion into line spectrum pairs is desirable since, as well known, spectrum lines have properties of quantization, interpolation and check of synthesis filter stability that are better than those of the coefficients. Before computing line spectrum pairs, in the block STA a smoothing of spectrum information related to formants is also carried out to match it to the quantization circuit resolution. This is accomplished by multiplying computed coefficients ai by a respective factor g₁i, whose value is typically less than 1 but quite near 1. This operation allows reducing the risk, in case of particularly narrow formants, of reproducing after quantization formants that are equally narrow, but shifted with respect to the original ones, and therefore reduces a possible cause for the degradation of coded signal quality.
  • The circuit STA computes coefficients ai according to the classical autocorrelation method, as described in "Digital Signal Processing of Speech Signals" by L.R. Rabiner and R.W. Schafer (Prentice - Hall Ed., Englewood Cliffs, N.J., USA, 1978), p. 401. For the computation, STA operates on a set of Lf+P input samples (in particular, the samples that occupy the last Lf+P positions in MT), obtained through a trapezoidal window that weights with a maximum weight (particularly 1) all samples except for the first and the last P ones, for which the weights have been determined with a simple linear interpolation operation between minimum and maximum weight: in this way, smoothing, that is required by the autocorrelation method to provide good results, is limited to the overlapping area between contiguous windows. The forward positioning of the window also takes into account the fact that, when coding the initial subframes of a frame (e.g. the first 3), in place of linear prediction coefficients computed for the frame itself, coefficients are used which are obtained by the conversion of line spectrum pair values determined through interpolation between values related to the previous frame and values related to the current frame. This ensures a gradual transition between current frame parameters and previous frame parameters. As concerns the window, as explained, it encompasses or spans over a current frame and the subsequent frame in the meaning that it comprises samples of both frames without, however, having to comprise two full frames.
  • The transformation of linear prediction coefficients into line spectrum pairs is carried out, for example, in the way described by P.Kabal and R.P. Ramachandran in the article "The computation of line spectral frequencies using Chebyshev polynomials", IEEE Transactions on Acoustic, Speech and Signal Processing, December 1986.
  • The operations of STA are typical of any linear prediction coder, and therefore a more detailed description is not necessary.
  • The indexes j(φ) are also supplied to a linear prediction coefficient reconstructing circuit STR1 that supplies filter LPC, short-term synthesis filters STS1, STS' and spectral weighting filters SW, SW' with quantized values of the coefficients, obtained by applying inverse procedures with respect to the ones used to transform the coefficients into line spectrum pairs. STR1 also computes interpolated values to be used in the first three subframes. To simplify, in the following, the quantized values are also designated ai.
  • The filter LPC receives the filtered speech signal samples xf(n) and filters them according to the conventional function
    Figure imgb0002

    generating the short-term prediction residual rs(n), that is supplied both to a low-pass filter FPB, that produces a filtered residual signal rf(n), and to time shift circuits TS, that produce a modified residual signal rm(n). Low-pass filtering facilitates, as well known, operations of a following long-term analysis circuits LTA.
  • The circuits LTA must determine, at each frame, and supply afollowing long-term synthesis filter LTS1 with the delay d (pitch period) with which a sample of an excitation signal is used to generate a reconstructed signal and the gain or coefficient b with which said sample is weighted.
  • The block LTA computes the delay d by maximizing the autocorrelation function
    Figure imgb0003

    where k can vary between a minimum value and a maximum value allowed for the delay d (e.g., 20 and 120), and x is a preset number, whose purpose is causing the length of the window taken into account for the calculation to enable obtaining a satisfactory value for d. Considering that the window must include the most recent samples, as already said, its length is a compromise between two opposed needs: the greater the length, the most accurate the evaluation; on the other hand, the shorter the window, the more its center is next to the end of the frame to be coded (Lf samples) and therefore it allows obtaining a current value next to that end, what is required for interpolation. For example, x can be K. In the preferred embodiment, the delay is never less than the length of a subframe, and this simplifies considerably subsequent operations. The value computed with (1) can also be subjected to corrections, that will be examined afterwards, aimed at guaranteeing a shape as much as possible smooth for d and compensating for synchronism losses due to the time shift.
  • The value of coefficient b is determined so as to minimize the energy of error signal rl(n), given by the equation

    r l (n) = r f (n) - b r f (n-d)   (2)
    Figure imgb0004


    For the value d of the delay to be used for the current frame, b is given by the equation b = R[r f (d)]/E(r f )
    Figure imgb0005
    , where E(rf) indicates the energy
    Figure imgb0006

       A minimum and a maximum, 0 and 1 respectively, are also set for the value of b. Values that are less than 0 are excluded because they would correspond to a signal overturning, that would also compel to transmit a sign bit, while values that are greater than 1 make the filter unstable, as well known. The value of b computed using (2) can also be subjected to corrections aimed at guaranteeing the best quality of the coded signal. Furthermore, in certain frames, instead of the values d and b computed with (1) and (2), it is possible to use values obtained by linear interpolation between values computed for the previous frame and values computed for the current frame.
  • Together with the computation of d and b, the prediction gain G is also computed: this is a quantity representing the ratio between the energies of input and output signals from the long-term predictor and gives a measure of long-term prediction efficiency. Gain G is
    defined by the expression
    Figure imgb0007

    where
       1 - bR[rf(d)]/E'(rf)
    Figure imgb0008

    Gain G allows establishing whether the speech segment being coded is voiced, that is indicated by values of G and b that are both greater than respective thresholds Gthr, bthr. In case of a voiced sound, LTA generates a flag V that is used to decide to carry out the interpolation and to introduce the time shift.
  • A first correction for delay d is based on the search for the local maximum of function (1) also in a given neighborhood (e.g., ñ 15%) of the value obtained at the previous frame: if this local maximum is different from the main maximum by an amount that is less than a certain limit, that new value is used that provides a more smooth outline that can be therefore interpolated. This secondary search is carried out only if the signal in the previous frame was strongly voiced and had been subjected to interpolation. Moreover, the correction, if any, is carried out before computing b and G, so as to use the already corrected value of d for these computations.
  • A second correction is linked to the presence of the time shift mechanism, that inserts a variable delay whose effects can be compared to those of a non-synchronous operation of the coder. To try to recover synchronous features, the value of d computed by LTA and possibly corrected as said before is changed by adding thereto a corrective term d' linked to the amount of the shift itself and given by the expression

    d' = h ˆ d/ΓLf
    Figure imgb0009


    where is the shift accumulated up to that frame expressed as number of samples of the residual signal upsampled by a factor G, while d and Lf have the meaning said before. Upsampling will be discussed in greater detail with reference to circuits TS. It means that the samples obtained by sampling the original speech signal at a first sampling rate are in turn submitted to a sampling at a higher rate. Thus, if samples obtained by an 8 kHz sampling are themselves sampled at 64 kHz, each 8 kHz sample will originate eight samples at 64 kHz. The correction can be carried out if interpolation is required in the current frame and if the speech segment is not voiced. The first condition is necessary since, if the interpolation is absent, no shift is carried out; moreover, the signal must not be voiced because in this situation an even minimal modification of d with respect to the exact value can usually be perceived. Before adding the corrective term to d, its absolute value is limited to a maximum value ¦d'¦max, for example 1. Furthermore, the correction is carried out only if it does not modify the decision about interpolation (that will be described afterwards) and does not take the value of d outside the provided range of values.
  • As regards b, a first correction consists of clipping b to a first upper limit b₁, since, if b is too high, an excessive energy increase would occur, which gives rise to noises. Limit b₁ is linked to the ratio between energies in a pitch period of the current frame and of the previous one and it is given by the expression

    b₁ = [E''(r f )₀/E''(r f )-1] d/2Lf
    Figure imgb0010


    where E''(rf) denotes the quantity
    Figure imgb0011

    that indeed is the energy in a pitch period d, and indexes 0,-1 denote current and previous frames, respectively. The correction is carried out if the energy in the previous frame exceeds a certain threshold.
  • A further limitation for b is carried out in case of low values of G (less than Gthr), that show speech segments with low periodicity, while b is relatively high (greater than a second limit b₂): in this case, the value b₂ is employed, since employing the actual value could produce artifacts in the coded signal.
  • As regards interpolation, this is carried out if the relative variation of d between two consecutive frames does not exceed, as absolute value, a predetermined amount (e.g., 15%) and if the values of b in these frames are both positive. The actual computation of the values of d and b to be used in case of interpolation is carried out in the long-term synthesis filter LTS1, to which LTA sends a flag F when the above mentioned conditions are verified. The same flag is also supplied to an error energy minimizing circuit EM determining the optimal time shift and excitation. Information about interpolation is also required by the synthesis filter in the decoder; however, it is not necessary to transmit it, since it can be immediately recreated in that filter, by the comparison between the values of d and b related to two frames, exactly like in the coder.
  • The values of d and b determined at each frame are converted as usual into the respective indexes j(d), j(b), that are the information related to long-term analysis to be inserted into the coded signal, and that are transmitted to the decoder, after suitable coding, through connections 2b, 2c. Index j(b) is determined through a quantization operation, during which, in addition to limiting the maximum value to 1, values of b that are less than half of the first quantized value are forced to 0. No quantization of d is however necessary, since d is already a discrete quantity: it is however preferable to transmit d under the form of an index for sake of uniformity with the other information. The conversion of the values of d into indexes practically consists of their shift, such as to make the possible range of values begin from 1 instead of from a value dmin. In the described example (101 values of d and j(d)), 7 bits will be necessary to code index j(d), and these bits will also allow coding of values of j(d) outside the provided range. One of these further values (e.g., value 127) is used to show forcing of b to 0 and it is supplied to the decoder in place of index j(d) corresponding to the actual value of d, since, if b = 0, the long-term synthesis filter does not provide contributions to the reconstructed signal and delay information is useless. In addition to information about forcing of b to 0, however, index j(b) corresponding to the minimum value of b is transmitted.
  • To simplify, circuits generating indexes j(b), j(d) are included into block LTA.
  • It must be noted that the correction of d to take into account possible shifts is carried out after the corrections of b, since only depending on the corrected values of b, circuit LTA can take decisions related to the sound nature and the need to carry out interpolation and therefore shift.
  • The operations performed by LTA are described in detail in the appendix, that includes program listing in C language. Given the listing, a technician has no problem in designing devices that perform the described functions.
  • Indexes j(d), j(b) are reconverted into quantized or reconstructed values of the respective parameters by reconstructing circuits LTR1, composed of simple read-only memories addressed by the indexes. During this reconstruction, LTR1 provides the actual values of d, b if j(d) shows a value allowed for the delay (that is, if j(d) is in the range 1 to 101). If j(d) shows any one of the values outside the allowed range (therefore its value is from 102 to 127), LTR1 provides value 0 for b and value dmin for d. The fact that, when reconstructing the parameters, all indexes j(d) not corresponding to a value allowed for the delay, and not only the one really used for this purpose, are interpreted as indication of forcing of b to 0, allows reconstructing the value b=0 even in case of possible errors on the least significant bits of that index. Anyway, if by chance the reconstruction of b=0 should fail, circuits LTR1 generate the minimum value of b since they have at their disposal the corresponding index j(b). To simplify, in the following, reconstructed (or quantized) values will also be shown by b, d.
  • The long-term synthesis filter LTS1 generates a reconstructed short-term residual signal ss(n), by filtering according to the conventional function 1/P(z) = 1/1-bz -d
    Figure imgb0012
    an excitation signal s₁(n). This one is composed of a shape information (innovation), represented by one of the words s(n) of an innovation codebook IC1, by a positive or null amplitude parameter g (innovation gain), chosen in a codebook of innovation gains IG1, and by a sign information, represented by a parameter σ (innovation sign) whose value is ±1. Signal s₁(n) is therefore given by s₁(n) = σ·g·s(n) = g₁·s(n)
    Figure imgb0013
    and is obtained through a multiplier M1. To simplify, we suppose that also parameter σ is read in codebook IG1. Even if, to facilitate understanding, codebooks IC1, IG1 are represented as circuit blocks (that could suggest the idea of memories that contain them), as said above, the particular structure of innovation codebook makes their storage superfluous. The structure of innovation and gain codebooks will be examined later.
  • In order to obtain a sample of the reconstructed residual ss(n), LTS1 must weight with the factor b the sample related to instant n-d. In case no interpolation has to be performed, operation of LTS1 is quite conventional. In case of interpolation, the values of d and b are computed sample by sample according to the equations

    d(n) = d(-1) + (n+1)Δd
    Figure imgb0014

    b(n) = b(-1) + (n+1)Δb   (5)
    Figure imgb0015


    with n = 0.....Lf-1, Dd=[Δ0 - d(-1)]/Lf
    Figure imgb0016
    and Δb = [b0 - b(-1)]/Lf
    Figure imgb0017
    . Symbols d0, b0 show the values related to the current frame, d(-1), b(-1) those related to the previous frame. The interpolation is therefore a linear one and extends over a whole frame. The values of d(n) and b(n) then vary sample by sample. As regards d(n), it will generally not be an integer number: this means that the value of signal ss(n) at the continuous time instant n-d(n) does not coincide with that of an actually available sample and must be evaluated: according to the invention, evaluation is performed through a second order polynomial interpolation (that is through a parabola) centered about the discrete time instant that is nearest to n-d(n); the value thus evaluated is then multiplied by the interpolated value b(n).
  • The interpolation procedure adopted has an extremely lower computation complexity than more sophisticated interpolation methods based on signal filtering. However, its effect is essentially a low-pass one, that is useful for the good operation of the coder since it avoids that the reconstructed signal has too marked periodicity properties.
  • The reconstructed short-term residual ss(n) is supplied to the short-term synthesis filter STS1, whose transfer function is 1/1-A(z). This filter generates the reconstructed speech signal y(n) that is supplied to the spectral weighting filter SW whose transfer function is, as usual, [1 - A(z)]/[1 - A w (z)]
    Figure imgb0018
    , where Aw(z) is the
    function
    Figure imgb0019

    with a wi = a i γ i
    Figure imgb0020
    , where γ is an experimentally determined corrective factor that determines band widening around formants. The reconstructed and weighted signal yw(n) is subtracted in an adder SM from the modified reconstructed and weighted signal xw(n) obtained by filtering the output signal from TS in the cascade of two filters STS', SW', respectively identical to STS1 and SW. At output of SM, a weighted error signal e(n) is obtained, that is supplied to the error energy minimizing circuit EM that performs all necessary operations to determine optimal shift and excitation.
  • Purpose of circuits TS is aligning in time the signal to be coded with the replica that long-term synthesis filter is able to produce, and in particular avoiding shifts among pitch peaks in the signal predicted by LTS1 and in the original one. For this purpose, TS at each subframe makes the time window of Ls samples, that locates the subframe itself, shift by a certain amount Dh. The shift to be applied is determined by unit EM with a fast search procedure within a range of values defined by a maximum allowable shift. Shift is applied on the residual signal and not on the original one because the resulting distortion is smoothed by the following filtering in STS', SW' and therefore is substantially imperceptible. The shift applied in a subframe is algebraically added to the one accumulated up to that time, providing a global shift ĥ, in order to avoid too sudden variations. Global shift also cannot exceed a certain maximum value (H samples of the original signal). The reason why H samples of the following frame have also been loaded in MT is therefore evident. Purpose of the shift variation limitation is avoiding excessive distortions; the limitation related to global shift instead is determined by the delay that has to be tolerated in coding procedures and therefore by the availability of future samples. Time shift has a resolution that is less than one sampling period of the original signal, and therefore it is necessary to carry out an upsampling of the residual signal.
  • Taking into account all this, circuit TS will include an upsampling circuit US (in practice an interpolating filter), that supplies at its output the upsampled residual r̂s(n̂), and a shifting element SH that receives from EM information about shift entity ĥ and generates the modified upsampled residual r̂m(n̂). In the example, upsampling ratio Γ is 8, and therefore the upsampled signal has a frequency of 64 kHz: this upsampling ratio provides a suitable resolution for all desired purposes. Moreover, for the correct operation of the interpolating filter, it is necessary to always have available a certain number of samples following the interested ones: this is the reason why the further K samples of the following frame are also loaded in MT.
  • It is not necessary to materially carry out the downsamping to obtain a modified residual signal with a 8-kHz sampling frequency, since this operation can be implicitly carried out, when necessary, by simply reading a sample of r̂m (n̂) every Γ, with an suitable phase. Downsampling is the inverse operation to upsampling, recovering the samples at lower rate.
  • Element SH will practically be a memory that loads, at each subframe, the ΓLs samples of the upsampled residual plus a certain number of following and previous samples linked to the maximum allowed shift in a frame (in practice, a number of samples equal to twice the maximum shift, as will be explained in the description of optimal shift search); SH is addressed for reading by the error energy minimizing unit EM, in such a way as to supply the following circuits with Ls samples adequately shifted with respect to the incoming subframe.
  • Turning back to the innovation codebook, this includes a certain number of words, each having Ls samples, of which only a very limited number is different from 0. This choice derives from the fact that, being the codebook quite limited, it would be an illusion to think to find inside it words with a lot of pulses (that is non-null samples) in which all pulses are actually suitable, and further enables reducing the amount of computations necessary when searching for the optimal excitation. In the preferred embodiment of the present invention, the codebook is composed of two parts. The first one includes Ls words having a single non-null sample, with amplitude equal to 1 and positive sign, and Ls-1 null samples. The non-null sample occupies a different position in all words, that therefore can be obtained one from the other by simply shifting the non-null sample by one position. For this first part of the codebook, signal s(n) can be represented as

    s (n) = δ (n-n₁)   (5)
    Figure imgb0021


    where δ is the well known unitary function and n, n₁ can have values between 0 and Ls-1.
  • The second part includes words with two samples whose amplitude is 1, and Ls-2 null samples. These words are generated starting from a limited number of key-words (in particular 3) with the method described in European Patent Application EP-A-0396121 in the name of CSELT. In the example taken into account, the three key-words have all the first pulse in position 0 and the second pulse in a respective key position n₂(1), n₂(2), n₂(3), and the other words are obtained making the pulse pair shift towards a word end till the second pulse reaches such end or the first pulse reaches the respective key position. Key positions are chosen in order to give origin to Ni2 (in particular 21) possible positions of the pulse pair; for each one of these positions, there are two words that are different one from the other by the second pulse sign, as described in said European Application, that take to Ls+2Ni2 (62 in the example) the total number of words in the innovation codebook. For this second part of the codebook, an innovation word is represented by the equation

    s (n) = δ(n-n₁) ± δ(n-n₂)   (6)
    Figure imgb0022


    with n = 0...Ls-1, n₁ = 0...Ls-1-n₂(p), n₂ = n₂(p)...Ls-1, p = 1...Nip, where n₂(p) shows the generic key position and Nip is the number of key positions used (3 in the example).
  • The innovation codebook structure, with few non-null samples and words obtained by shifting samples by one position starting from a limited number of keys, is a simple deterministic structure that enables a fast search procedure of the optimal excitation that requires neither codebook storage nor the effective filtering of the candidate excitation signal.
  • During the search for optimal innovation, the test with words of the first part of the codebook must be carried out only if long-term analysis has indicated a voiced sound or, on the contrary, when strong energy concentrations are noted in short signal sections. These strong concentrations can in fact signal the onset of a voiced section, that cannot still be classified as such, since classification is based on long-term analysis and in the previous signal sections there were no useful features to indicate such onset. Under these conditions, therefore, filter LTS1 would indeed not be able to supply a correct predicted signal. Now, it is mandatory, for a good coded signal quality, that pitch pulses be correctly reproduced, and therefore use of single-pulse words proves itself useful to indeed compensate for an inadequate operation (in voiced sections) or for an impossible correct operation (in onsets) of long-term synthesis filter. Single-pulse words, instead, must not be used to reproduce unvoiced sounds that are not onsets, where their use is counterproductive, even in case it is actually one of them to provide minimum error signal energy, since the subjective effect is usually worse.
  • The manner in which strong energy concentrations in short times are detected will be described afterwards.
  • Words in the codebook are identified by a respective index j(s); the index related to the optimal word, adequately coded, is transmitted to the decoder through a connection 2d. Since in the described example the codebook includes 62 words, to which as many indexes j(s) correspond, without having to modify the number of bits coding j(s), two further values of j(s) are available that do not correspond to any word in the codebook. These are used to represent a null innovation gain, as will be said afterwards; similarly to what has been done for long-term prediction delay and coefficient, when generating the indexes, only one of the two values of j(s) not corresponding to an innovation word will be used to indicate g = 0 and, when decoding, g will be set to 0 in correspondence with both values of j(s).
  • As regards gain g, this is quantized using a codebook built so as to allow saving coding bits with respect to what would actually be necessary to represent all possible values provided in the codebook. Information about gain, for each subframe, is represented in the form of two indexes j(gmax), j(gnor), the first one of which is linked to the maximum value of g in the frame, and the second one to the difference between such maximum value and the actual value, and by sign σ. This information is transmitted to the decoder through a connection 2e.
  • The codebook includes a number Nig of possible absolute values of g that can be represented as Nig = Nim + Nin - 1
    Figure imgb0023
    where Nim and Nin are two different powers of 2. For example, we can have Nim = 2⁴ and Nin = 2², or Nim = 2⁴ and Nin = 2³. At each subframe, the optimal value of g determined with the error minimizing procedure that will be described afterwards is quantized, generating a respective index j(g) that is not transmitted but is reconstructed in the decoder. At the end of the frame, value j(gmax) related to the maximum frame gain is identified and is transmitted as such if it is not less than Nin; otherwise, index j(gmax) is forced to value Nin. In this way, j(gmax) can only assume Nim values and therefore the number of coding bits is limited. Once having identified j(gmax), index j(gnor) is computed for every subframe with the equation j(gnor) = j(gmax)-j(g)
    Figure imgb0024
    ; j(gnor) can have values in the range between 0 and Nim+Nin-2. The actual value of index j(gnor) is transmitted only if it is not greater than Nin-1; otherwise, gain is deemed 0 (that is, innovation is silenced for subframes where gain is very small with respect to the maximum one) and index j(s) of the innovation word is forced to one of the values that do not correspond to any codebook word to show transmission of a word with null gain. In this way, a reduced differential dynamics is used and the bits that should have been used to represent gain on the whole dynamics, are saved, at the expense of a slight performance loss due to possible innovation silencing. To minimize the effect of channel errors on innovation index j(s), in case of silencing the value Nin-1 for index j(gnor) is anyway transmitted.
  • The gain codebook can be a logarithmic codebook, so that the ratio between two consecutive values is a constant. The ratio must take into account several requirements:
    • values in dB must be as near as possible to allow a quantization as accurate as possible;
    • global dynamics between minimum gain g(1) and maximum one g(Nim+Nin-1) must be adequately extended to cover the different types of sound and a reasonable set of different voice levels;
    • differential dynamics between g(x-Nim+1) and g(x) must be adequately extended to make the probability of silencing reasonably low.
  • For example, with the above values of Nim, Nin, the value of the ratio between two consecutive gain levels can range from 3 to 6 dB.
  • The fast search procedure for optimal shift and excitation will now be described, referring also to the operative diagram in Fig. 2, that correspond to the set of blocks M1, LTS, STS, STS', SM, SW, SW' of Fig. 1. In Fig. 2, the same symbols as in Fig. 1 are used, with the exception of blocks STW1, STW2 that represent the filter resulting from the series of filters STS1, SW and respectively STS', SW', that is a filter with transfer function 1/1-A w (z)
    Figure imgb0025
    . In this Figure, each of the filters has been divided into an element with null input (LTSa, STW1a, STW2a) that provides contribution of initial conditions (that is of filtering memories for previous subframes), and into an element (STW1b, STW2b) that is reset at each subframe (filtering with null initial conditions), as indicated by signal R supplied by a time base, not shown. Filtering with null initial conditions of excitation is only the short-term filtering, since it has been supposed that delay d is not less than a subframe.
  • The optimal shift determination is composed of three steps:
    • evaluation of the need to perform a shift;
    • determination of an suitable range of shift values;
    • search for the optimal shift in the range.
  • In the first step, it is checked if three conditions are satisfied:
    • the subframe is not silence, which is shown by the fact that the energy of rs(n) is greater than a given thres hold;
         the signal is voiced or has been subjected to interpolation, which is shown by flags F, V coming from LTA;
    • a peak of rs(n) actually occurs in the subframe, which is shown by the fact that the average power of rs(n) in the subframe (that is the energy divided by the number Ls of samples) is greater than or equal to the energy in a period of length d that ends with the last sample of the subframe itself.
  • The reason for the first condition is obvious. As regards the second and the third one, shift must be performed only if there is a pitch peak in the subframe. This occurs first of all in voiced sections; the fact that an interpolation occurred, that is, that the values of parameters obtained in two subsequent frames are very near, suggest a certain periodicity in the signal segment that must be coded, and therefore enabling the shift also in this case can be useful to further reduce risks of misalignment between the reconstruced signal and the original signal.
  • Computation of energy and powers can be carried out indifferently on the upsampled signal or on the original one. During these computations, the maximum absolute value of r̂s in the current subframe and its position are also obtained: they will be used in determining the shift. To determine the position of the maximum, it is mandatory to operate on the upsampled signal to get maximum resolution.
  • The second step determines the lower and upper extremes ĥmin, ĥmax of a range that extends around shift value ĥ accumulated so far in the frame. Values ĥmax, ĥmin are initially fixed so that differences ĥmax - ĥ and ĥ - ĥmin have a prearranged value Γ · Δh, for example 20 samples of the upsampled signal r̂s. There exists therefore a maximum number of possible values (41 in the example) among which the optimal shift can be searched for. The actual extreme values ĥmin, ĥmax could be not symmetrical with respect to value h (that is, the range can be limited on one or both sides of the accumulated value h), since it is necessary to avoid shifting the subframe too much, both in the past, with possible duplication of a maximum of r̂s previously taken into account, and in the future with consequent loss of a maximum. This check is made possible by storing the maximum of r̂s in the subframe. However, unless range limiting has not been bilateral, the search for the optimal shift is carried out trying to keep constant the range width, by taking into account also some values beyond the extreme that is not subjected to limitation. In any case, the shift to be carried out must not make value H exceeded.
  • The optimal shift value within the test range is the one minimizing energy of an error signal e₁(n) represented by the difference between reconstructed and weighted modified signal xw(n) (Fig. 1) and contribution yw1(n) of excitation filtering memories, and is obtained with a fast search procedure that allows reducing the amount of necessary computations.
  • For this fast search, it must be taken into account on one hand that output signal xw(n) from STW' can be expressed as
    Figure imgb0026

    (where n ranges from 0 to Ls-1), and on the other hand that the same signal is the sum of output xw1 of STW2a and output xw2 of STW2b. Summation in (7) represents signal xw1,that can be computed, once and for all, like the corresponding contribution yw1 of chain LTSa, STW1a, and therefore an error e₀ = x w1 - y w1
    Figure imgb0027
    can also be computed once and for all, that appears at the output of an adder SMa. Error e₁ can then be written e₁ = e₀ + x w2
    Figure imgb0028
    , where xw2 depends on s and therefore on the shift. It is then necessary to determine xw2 for all values of the shift, to compute for each one the respective energy of e₁, and to store value of that provides minimum energy and corresponding signal xw(n).
  • The procedure to determine xw2 adopted according to the invention takes into account that, for a given shift value, signal xw2 is given by
    Figure imgb0029

    The upper limit of the summation is the minimum between n and P, since when filtering with null initial conditions, samples with n-k < 0, that is, samples of the previous subframe, must not be taken into account. Values of xw2 are actually computed according to (8) for a first group of Γ possible shifts that range from hmax to ĥmax-Γ+1; obviously, the tests will be stopped if by chance hmin is reached before having examined all Γ shifts. For the other values of shift, from ĥmax -Γ to ĥmin, instead of being computed with (8), xw2 is computed according to the equation

    x w2 (n)=x w2 (n-1)+Q(n) r ˆ s ( h ˆ )   (9)
    Figure imgb0030


    (n = Ls-1...1)
    Figure imgb0031

    with x w2 (0) = r ˆ s ( h ˆ ).
    Figure imgb0032

  • In (9), Q(n) shows the truncated pulse response (since it is computed only for Ls values of n) of filter STW, with Q(0) = 1.
  • It can be immediately noted that, taking into account that Q is determined once and for all, beside a certain value, (9) requires much fewer computations than (8).
  • It must further be stated that Γ values of xw2 must actually be computed according to (8) and (9), that is one for each of the Γ upsampled signal samples corresponding to a 8-kHz sampling period.
  • Once having minimized the energy of e₁(n) and having found the optimal shift, minimization of the energy of e(n) is started to find the optimal excitation. Unit EM directly computes an expression of the energy to be minimized that is function of the position of the pulses in the innovation word, and for this purpose the pulse response Q is employed, computed during search for the optimal shift. Computation of the pulse response is made convenient with respect to filtering execution by the fact that every word includes two non-null samples at most. Moreover, taking into account the more general case of the words with 2 pulses, the global pulse response is the sum of two responses spaced by a distance equal to the key; responses for all other words linked to a key are then obtained simply by a translation by one sample at a time. To simplify, in the following mathematical expressions, the variability range of the summation index for summations extended to all samples in a subframe has not been indicated.
  • Error e(n), for a generic excitation word, is given by e(n) = e₁(n) - y w2 (n) = e₁(n) - g₁ u(n)
    Figure imgb0033
    , where u(n) is the output signal from STW1b. Energy of e(n) is given by

    E(e) = Σ e²(n) = Σ [e₁(n) - g₁ u(n)]²   (10)
    Figure imgb0034


    that can be written as E(e) = Σe₁² - 2g₁ Σe₁ u + g₁² Σu²
    Figure imgb0035
    . Taking into account that the first and the last summations represent energies of signals e₁, u, and the second one represents mutual correlation R(e₁u)(k) between them, evaluated for k=0 and in the following simply called R(e₁u), we have

    E(e) = E(e₁) - 2g₁R(e₁u) + g₁²E(u)   (11).
    Figure imgb0036

  • Minimizing E(n) is the same as maximizing the difference of energies

    ΔE = E(e₁) - E(e) = 2g₁R(e₁u) - g₁²E(u)   (12).
    Figure imgb0037


    For each word of the examined codebook, the maximum of (12) is obtained for a value g₀ = R(e₁u)/E(u) of g₁
    Figure imgb0038
    , as immediately appears by computing the derivative with respect to g₁ and making it equal to 0, to which a value

    ΔEo=R(e₁u)2/E(u) =g₀ R(e₁u)   (13)
    Figure imgb0039


    corresponds.
  • The particular structure of the innovation codebook allows to directly obtain E(u) and R(e₁u), that depend on the position of the pulse or pulses in the word, by exploiting the pulse response of filter STW1, that is equal to the one of filter STW2, previously determined.
  • In fact

    E(u) = Σ [Q(n-n₁) ñ Q(n-n₂)]² = Σ Q²(n-n₁) + Σ Q²(n-n₂)
    Figure imgb0040

    ± 2 Σ Q(n-n₁) Q(n-n₂)
    Figure imgb0041


    or, more simply,

    E(u) = Eq(n₁) + Eq(n₂) ρ r(n₁, n₂).   (14)
    Figure imgb0042


    where Eq is energy of the adequately truncated signal Q (that is, computed for a number of samples determined by the position of n₁, n₂). Moreover, R(e₁u) can be written

    R(e₁u) = R[e₁q(n₁)] ñ R[e₁q(n₂)]   (15)
    Figure imgb0043


    Ls-1-K
    Figure imgb0044


    where

    R[e₁q(K)] = Σ e₁(n+K)Q(n).
    Figure imgb0045


       n=0
       It is clear that for single-pulse words, relations (14) and (15) are simply reduced to E(u) = Eq( n1 )
    Figure imgb0046
    and R(e₁u) = R[e₁q(n₁)]
    Figure imgb0047
    .
  • The operations performed at each subframe by EM to determine the optimal excitation can be considered as divided into three steps.
    • a) Before examining the effect of each innovation word, as soon as values ai are available, EM computes and stores the possible values of the three addends in (14). Computation will be carried out only for the first 4 subframes, since, as already said, in the following subframes filter coefficients ai do not change. Terms Eq can be computed with a simple iterative procedure, according to the equation

      Eq(Ls-1-n) = Eq(Ls-n) + Q²(n)   (16)
      Figure imgb0048


      with n =1...Ls-1 and Eq(Ls-1) = 1
      Figure imgb0049
      .
      Moreover, since the codebook includes only Ni2 possible pairs of values n1,n2, computation of ρ is carried out only for these pairs, according to the expressions

      ρ k = 2Q[n₂(p)]
      Figure imgb0050

      ρ k = 2ρ k +1 + 2Q[n+n₂(p)] Q(n)
      Figure imgb0051


      where n₂(p) has the already cited meaning, n = 1...Ls-1- n₂(p) and k = Ni2...1 is the generic pair of values n₁, n₂.
    • b) As soon as the optimal value of e₁ is available, always before the search procedure, EM computes and stores values R(e₁q).
    • c) After these operations, EM computes values of E(u),R(e₁u) word by word, determining value g₀ and the related ΔE, and storing the word index and the related value of g that originated the energy minimum.
  • As said above, if the sound is not voiced, the tests with words of the first part of the codebook are carried out only if strong energy concentrations in short times are noted,that can show the onset of a voiced signal section. For this purpose, within the subframe, energy of a certain group of samples of the modified residual is computed (e.g. 5 samples), starting from the beginning of the subframe and shifting, by one sample at a time, the window selecting the group till the whole subframe has been scanned, and storing which group shows maximum energy. Furthermore, the average power (that is the energy divided by the number of samples) in the window where the maximum occurred and the average power in the subframe are also computed. Tests with single-pulse words will be enabled if subframe energy and the ratio between the average powers in the window and in the subframe are greater than suitable thresholds. Moreover, if the optimal innovation is composed of a single-pulse word, the absolute value of gain g is limited to a maximum value ¦g¦max = ¦rs¦max, where is a parameter approximately equal to 1 and ¦rs¦max is the residual maximum computed during operations to determine min, max. Purpose of this limitation is also to prevent insertion into the signal of a pulse with too high energy with respect to the maximum residual amplitude in the same subframe.
  • At the end of each subframe, initial conditions in filters LTSa, STW1a, STW2a will have to be updated. To update LTSa, that is ss(n), it will be necessary to add a pulse or a pair of pulses (corresponding to the optimal innovation word) to ss1(n). To update yw(n), it will be necessary to add to yw1(n) one or two pulse responses (corresponding to signal u(n)) adequately shifted and multiplied by gain g in order to supply the value of yw2 corresponding to the optimal excitation. The pulse response will also be exploited to update STW2a. Furthermore, since filters STW have order P, only the last P samples of such responses (from Ls to Ls-P) are of interest.
       The operations of EM are also included in the appendix.
  • The decoder structure will now be described, referring to the diagram in Fig. 3, where blocks corresponding to the ones already described with reference to Fig. 1 are shown by the same reference symbols, followed by digit 2. The various reconstructed signals are also shown with the same reference symbols used for the original signals in the coder.
  • The decoder receives from the coder, through connections 2a-2e, indexes j(j), j(φ), j(b), j(s), j(gmax), j(gnor) and sign σ for the innovation gain. At each subframe, index j(s) selects an innovation word s(n) in codebook IC2 or indicates a subframe that does not provide innovation contributions (g=0). If a word has been selected, it is multiplied in M2 by gain g whose absolute value is selected in the codebook IG2 by an index j(g) = j(gmax) - j(gnor)
    Figure imgb0052
    and whose sign is σ, thereby providing the reconstructed excitation signal (or fixed codebook contribution) s₁(n).
  • This signal is filtered in the long-term synthesis filter LTS2 to provide the reconstructed short-term residual ss(n). In order to operate exactly like its replica LTS1 in the coder, filter LTS2 must receive from reconstruction circuit LTR2 parameters d, b and flag F indicating the possible need to carry out interpolation of d and b. Therefore, LTR2 will include a read-only memory with two tables addressed by indexes j(d), j(b), like LTR1 (Fig. 1), in addition to a circuit suitable to store values of d, b related to two consecutive frames and to carry out the comparisons, described in connection with the coder, necessary to determine if interpolation of d, b is necessary. Signal ss(n) outgoing from LTS2 is filtered in the short-term synthesis filter STS2 using coefficients ai generated in coefficient reconstructing circuit STR2 starting from indexes j(φ). In STS2, too, for the first subframes of each frame, interpolated coefficients will be used. The reconstructed speech signal y(n) is still subjected to a further filtering in an adaptive filter PF that uses coefficients obtained from linear prediction coefficients ai and that inserts into the reconstructed speech signal a distortion that improves the perceptual effect. At the output of PF, there is a filtered reconstructed signal yp(n). Employ of filters like PF when coding a speech signal is well known to the technicians and does not require further explanations.
  • It will be noted that the decoder does not take into account the possible shift carried out into the coder: in fact, purpose of the shift is just causing the synthesized signal to be a replica as good as possible of the original signal, and therefore the decoder only requires information related to excitation and filters.
  • It is clear that what has been described is provided only by way of non-limiting example and that variations and modifications are possible without departing from the scope of the present invention. Thus, for example, even if reference has been made, about innovation, to sample whose amplitude was 1, it is also possible to use samples whose amplitudes are chosen in a finite set of values (e.g., √1, ± √2, ± 1/√2): obviously, in this case the coded signal will also include information about the relative amplitude of innovation samples. Generalizing equations (14), (15) to the case of pulses whose amplitude is not unitary is immediate. The choice of sample amplitudes in a finite set of values is not limiting, because anyway relative amplitudes of the samples themselves are quantized.
  • To simplify the drawings, no timing signals for the various blocks have been shown; on the other hand, the timing sequence of operations clearly results from the description.
    Figure imgb0053
    Figure imgb0054
    Figure imgb0055
    Figure imgb0056
    Figure imgb0057
    Figure imgb0058
    Figure imgb0059
    Figure imgb0060
    Figure imgb0061
    Figure imgb0062
    Figure imgb0063
    Figure imgb0064

Claims (32)

  1. Method of coding/decoding speech signals, including, in a coding step, the operations of:
    - sampling the original speech signals at a first sampling rate and dividing the resulting sequence of samples [x(n)] into a plurality of blocks of subsequent samples, each block comprising a first predetermined number Ls of samples or an integer multiple of said first number;
    - performing a short-term analysis of the original speech signal to determine a group of linear prediction coefficients (ai) to be used for a linear prediction filtering, a short-term synthesis filtering and a spectral weighting filtering, generating a representation of said coefficients in the frequency domain, and inserting into the coded signal information [j(φ)] related to the value of said representation, said information being valid for a period equal to the duration of a block or of a group of consecutive blocks of samples;
    - obtaining, through said linear prediction filtering, a short-term residual signal [rs(n)] for said block or group of blocks of samples;
    - subjecting said residual signal [rs(n)] to a long-term analysis, to determine long-term analysis parameters comprising a long-term synthesis filtering delay d and coefficient b, and inserting into the coded signal information [j(d), j(b)] related to the values of said parameters, said information being valid for a time equal to the duration of a block or a group of consecutive blocks of samples;
    - reproducing every block of speech signal samples to be coded with a reconstructed and weighted speech signal [yw(n)], obtained by subjecting to long-term synthesis filtering, short-term synthesis filtering and spectral weighting filtering an excitation signal chosen within a set of excitation signals, each comprising an amplitude contribution (excitation gain) and a shape contribution (innovation), the latter being composed of a limited number of pulses, much less than said first number of samples, with predefined positions and amplitudes belonging to a respective finite set;
    - subjecting a set of samples of said residual signal [rs(n)] to a time shift by discrete steps, each set of residual signal samples having a number of samples equal to the number of samples in a block of speech signal samples to be coded, to align in time the residual signal with a reconstructed residual signal [ss(n)] obtained as result of the short-term synthesis filtering of an excitation signal, the shift generating a modified residual signal [r̂m(n̂)] that is subjected to a long-term synthesis filtering and to a spectral weighting filtering, identical to those carried out for the excitation signals, to generate a reconstructed and weighted modified speech signal [xw(n)];
    - determining an optimal excitation signal for each block of samples, by minimizing the energy of a weighted error signal [e(n)] represented by the difference between the reconstructed and weighted modified signal [xw(n)] and the reconstructed and weighted signal [yw(n)], and inserting into the coded signal information [j(s), j(gmax), j(gnor), σ] that identifies the optimal excitation signal; characterized in that:
    - the innovation pulses are the only non-null samples of words composed of said first number Ls of samples,
    - the innovation words for a first subset of excitation signals include a pair of pulses, a limited group of words of the first set being key-words in which the two pulses are placed in predetermined key positions and the other words in the subset being obtained from each of the key-words by each time simultaneously shifting the pulses by one position towards a word end, till one of the pulses reaches said end or the key position of the other pulse in the starting word, the shifting direction being the same for all words; and
    - the innovation words for a second subset of excitation signals include only one pulse whose position is different for each signal;
       and in that for said determination of the optimal excitation signal the energy of said weighted error signal is directly computed, by exploiting a pulse response Q(n) of filters that carry out synthesis and spectral weighting filterings of the excitation signal, with the following operations:
    - determining said pulse response Q(n) and the energy Eq thereof for each of the possible pulse positions in the excitation signals;
    - determining a first partial error signal [e₁(n)], represented by the difference between the reconstructed and weighted signal [xw(n)] and a contribution [yw1(n)] of the excitation signal filtering memory, and the energy of the same error signal;
    - determining a first correlation R(e₁q) between said first partial error signal [e₁(n)] and the pulse response Q(n) for each of the pulses of an excitation signal;
    - determining for each excitation signal, starting from said pulse response, a signal [u(n)] representative of a contribution of the filtering with null initial conditions of the excitation signal;
    - determining the energy E(u) of said signal [u(n)] representative of the contribution of a filtering with null initial conditions of the excitation signal, and determine a second correlation R(e₁u) between said signal [u(n)] representative of the contribution of the filtering with null initial conditions of the excitation signal and the first partial error signal [e₁(n)];
    - determining, for each excitation signal, an optimal value of the amplitude contribution as ratio between said second correlation and the energy of the signal resulting from filtering at null initial conditions;
    - computing, as function of said second correlation R(e₁u), of said energy Eu of the signal representative of the contribution of the filtering with null initial conditions of excitation and of said energy E(e₁) of the first partial error signal, the value of error signal energy for each excitation signal.
  2. Method according to claim 1, characterized in that said pulses have unitary amplitude.
  3. Method according to claim 1 or 2, wherein the sequence of speech signal samples is divided into frames that are composed by a plurality of consecutive subframes each corresponding to one of said blocks and include a second predetermined number Lf of samples, and wherein said short-term analysis is carried out for each frame, characterized in that for said short-term analysis in a frame a sample window is analysed, whose length is Lf+P (P = number of linear prediction coefficients in each group), that encompasses a current frame and the subsequent frame and also includes a predefined number H+K of samples of said subsequent frame, said window being a trapezoidal window that weights all samples with maximum weight, apart from the first and the last P samples, for which the weighting factors are determined through linear interpolation between a minimum weight and the maximum weight.
  4. Method according to claim 3, characterized in that for the initial subframes of each frame, the linear prediction coefficients ai are coefficients obtained as result of an interpolation between the values provided by short-term analysis for the current frame and those provided for the previous frame, the interpolation being carried out by operating on said representation.
  5. Method according to any one of the previous claims, wherein the linear prediction residual is subjected to low-pass filtering before long-term analysis, thereby providing a filtered residual signal [rf(n)].
  6. Method according to any of claims 1 to 5, wherein the sequence of speech signal samples is divided into frames that are composed of a plurality of consecutive subframes each corresponding to one of said blocks and include a second predetermined number Lf of samples, and wherein said long-term analysis is carried out for each frame, characterized in that to determine said long-term analysis parameters, a sample window of the filtered residual signal [rf(n)] is analysed, that encompasses a current frame and the subsequent frame and also includes a predefined number H+K of samples of said subsequent frame.
  7. Method according to claim 6, characterized in that said long-term analysis further includes the operation of determining, for each frame, a long-term prediction gain G, representative of the ratio between the energies of filtered residual signal at the input of and at the output from means that carry out said analysis, the gain being also determined at each frame.
  8. Method according to claim 6 or 7, characterized in that said long-term analysis further includes the operations of:
    - classifying a speech signal segment corresponding to a frame as voiced or unvoiced, depending on the value of said long-term analysis coefficient b and on prediction gain G, and generating a first flag (V) in case the segment is classified as voiced;
    - comparing values of long-term analysis delay d and coefficient b related to a current frame with those related to the previous frame and generating, when delay variation is less than a predefined amount and coefficient values in both frames are positive, a second flag (F) that enables interpolation between delay and coefficient values computed for said previous frame and those computed for the current frame.
  9. Method according to any of the claims from 6 to 8, wherein long-term analysis delay d is determined as maximum of the autocorrelation function of the filtered residual within the window used for the analysis itself, characterized in that, before determining long-term analysis coefficient b and prediction gain G for the current frame, the local maximum of said autocorrelation function is determined even in a neighborhood of the maximum of the same function in the previous frame, if said first and second flags had been generated in said previous frame, and said local maximum is used as delay for current frame if it is different by an amount that is less than a predefined value from the maximum in the window related to current frame.
  10. Method according to any of the claims from 6 to 9, characterized in that the value of long-term analysis coefficient b is clipped to a first maximum value b₁, linked to the ratio between energy of the filtered residual signal in the current frame and in the previous frame in an interval whose length is equal to the long-term analysis delay.
  11. Method according to any of the claims from 6 to 10, characterized in that the value of long-term analysis coefficient b is clipped to a second maximum value b₂, if it exceeds such value while the prediction gain G is less than a gain threshold Gthr.
  12. Method according to claim 8 or any of claims 9 to 11, if referred to claim 8, characterized in that said interpolation of long-term analysis delay d and coefficient b is a linear interpolation extended over a whole frame and, in case of a non-integer interpolated delay value, the value of a corresponding sample of the reconstructed residual signal ss(n) is evaluated with a second-order polynomial interpolation centered around the integer delay value that is nearest to said interpolated value.
  13. Method according to any of the claims from 6 to 12, wherein information related to long-term analysis coefficient b inserted in the coded signal are indexes representative of quantized coefficient values, and information related to long-term analysis delay d allows representing also delay values that are outside an interval of allowed delays, characterized in that coefficient values that are less than a predefined fraction of a minimum quantized value are forced to 0 and, in case of forcing to 0, delay information representative of a value that is outside said interval of allowed delays and the index representative of said minimum quantized value, are inserted in the coded signal.
  14. Method according to any of claims 1 to 13, characterized in that, to determine the optimal excitation, excitation signals of said second subset are used if said first flag (V) has been generated or, if said flag has not been generated, if analysis of the energy distribution in the, modified residual signal shows an energy concentration in short times, that indicates the onset of a voiced sound.
  15. Method according to claim 14, characterized in that, to determine the optimal excitation, the excitation signals of the two subsets are normalized with different normalization factors, linked to the number of pulses present in respective subset signals.
  16. Method according to claim 14 or 15, characterized in that, if said first flag (V) has been generated, the amplitude contribution for excitation signals of said second subset is limited in such a way as not to exceed a threshold that is proportional to the absolute value of the residual signal.
  17. Method according to any of claims 14 to 16, characterized in that said analysis of the energy distribution of the modified residual signal is carried out at each subframe and includes the operations of:
    - dividing the subframe into a plurality of partially overlapping windows, a first and a last window coinciding with a respective initial or final part of the subframe, the windows following the first one being each shifted by one sample with respect to the previous window;
    - determining the energy and the power of the modified residual signal in the whole subframe and the energy in each one of said windows;
    - determining the power for the window whose energy is maximum and determining the ratio between the power in said window and the power in the subframe; and
    - comparing said maximum energy and said power ratio with respective thresholds, said energy concentration being recognized if said maximum energy and said ratio are not less than respective thresholds.
  18. Method according to any of the claims from 6 to 17, characterized in that, if only the second flag (F) has been generated, long-term analysis delay d is varied by an amount that is proportional to entity of the shift accumulated up to the previous frame, the absolute value of the variation being limited to a predefined maximum.
  19. Method according to claim 18, characterized in that said delay variation is disabled if it causes the decision about interpolation to be altered and the delay to go out of a predetermined interval of values.
  20. Method according to any of the claims from 6 to 19, characterized in that the residual signal is subjected to said time shift in a subframe if at least one of said first and second flags has been generated and if an analysis of the modified residual signal energy in the subframe shows that the corresponding speech signal segment is not silence and includes a pitch peak, the shift related to a subframe being accumulated with that of the previous subframes of the same frame, so that the total shift in a frame remains less than a maximum shift.
  21. Method according to claim 20, characterized in that said analysis of the modified residual signal energy includes the operations of:
    - comparing the energy itself with an energy threshold, which, when reached, shows that the corresponding speech signal segment is not silence;
    - determining the modified residual signal power in the subframe and in an interval whose length is equal to the long-term analysis delay, and the ratio between such powers; and
    - comparing such ratio with a power threshold, which, when exceeded, shows the presence of a pitch peak in the subframe.
  22. Method according to claim 20 or 21, characterized in that the shift for a subframe is determined, before determining an optimal excitation signal, within an interval that extends around the shift accumulated in previous subframes of the same frame, and it is the value that minimizes energy of said first partial error signal [e₁(n)].
  23. Method according to claim 20, characterized in that to determine the shift, an upsampling of the residual signal is carried out, at a second rate that is a multiple of the first rate, the shift in a subframe being equal to one or more samples of the upsampled residual signal.
  24. Method according to claim 22 or 23, characterized in that said first partial error signal is computed as sum between a signal [xw2(n)] representative of the modified residual signal filtered with null initial conditions and a second partial error signal [e₀(n)], which is the difference between the memory contribution [xw1(n)] of the modified residual signal filtering and the memory contribution [yw1(n)] of the excitation filtering, the signal [xw2(n)] representative of the modified residual filtered with null initial conditions related to a sample in a subframe being obtained by carrying out the actual filtering of the modified residual signal for shift values between the upper end of the interval and an intermediate value between the two extreme values, while for each of the remaining shifts in the interval it is iteratively obtained from the value related to the previous sample and from said pulse response.
  25. Method according to claim 24, characterized in that the determination of said interval of shift values is carried out through the following operations:
    - fixing for the interval ends two symmetrical values with respect to the accumulated value;
    - determining the residual signal peak position in the upsampled residual signal and comparing it with the peak position in the previous subframe;
    - limiting the interval extension on one or both sides of the accumulated value to avoid an excessive shift of the subframe into the past and/or into the future, with consequent duplication or loss of residual signal peaks.
  26. Method according to claim 25, characterized in that, in case of interval limitation on one side only of the accumulated value, the search for the shift is carried out also taking into account a certain number of values beyond the interval end not interested by the limitation, such that the global number of tested values is equal to the number of values included between said symmetrical values.
  27. Method according to any of the claims from 1 to 26, including a decoding step where, starting from the information [j(φ), j(d), j(b), j(s), j(gnor), j(gmax), σ] about the linear prediction coefficient representation, the long-term analysis parameters and the excitation signal, said representation is reconstructed, reconstructed linear prediction coefficients are obtained therefrom, the long-term analysis parameters are reconstructed, an excitation signal is chosen in a set of excitation signals corresponding to the one used in the coding step, and said signal is subjected to a short-term and a long-term synthesis filtering, identical to the ones carried out in the coding step, by using reconstructed linear prediction coefficients ai and long-term analysis delay d and coefficient b, to generate a reconstructed block of speech signal samples [y(n)] for each excitation signal [s(n)], characterized in that every block of the reconstructed speech signal [y(n)], during the initial part of a validity period of linear prediction coefficients, is generated by carrying out the short-term synthesis filtering with reconstructed linear prediction coefficients ai obtained as result of an interpolation between reconstructed values related to an immediately previous validity period and reconstructed values related to the current period, and in that the values of long term analysis delay d and coefficient b, related to two consecutive validity periods, are compared and, if the delay variation is less than a predefined amount and the coefficient is positive in both periods, a flag corresponding to that second flag is generated, to enable carrying out, during long-term synthesis filtering, an interpolation between the long-term analysis parameter values related to said two validity periods.
  28. Apparatus for coding/decoding speech signals using analysis-by-synthesis techniques, including a coder composed of:
    - means (MT) for sampling at a first rate a speech signal and to divide the sample sequence into blocks comprising a first number of samples;
    - short-term analysis means (STA, STR1) for computing a group of linear prediction coefficients ai for one or more blocks of samples, for transforming said coefficients into a representation thereof in the frequency domain, for obtaining from said representation indexes j(φ) identifying the coefficients themselves, to be inserted into the coded signal, and for reconstructing the coefficients starting from said indexes, every group of linear prediction coefficients being valid for a period of time equal to the duration of one or more blocks of samples;
    - a linear prediction filter (LPC) that receives blocks of signal samples from the sampling means (MT) and linear prediction coefficients ai from the short-term analysis means (STA, STR1) and generates a short-term prediction residual signal rs(n);
    - long-term analysis means (LTA, LTR1) for obtaining, from said residual signal, parameters for a long-term synthesis filtering, which parameters comprise a delay (d) and a coefficient (b), and for transforming said parameters into indexes [j(b), j(d)] to be inserted into the coded signal, the long-term analysis parameters being valid for a period of time equal to the duration of one or more blocks of samples;
    - a first filtering system (LTS1, STS1, SW) that: includes the series of a long-term synthesis filter (LTS1), that receives from the long-term analysis means (LTA, LTR1) said parameters, and of a short-term synthesis filter (STS1) and a spectral weighting filter (SW), that receive from said short-term analysis means (STA, STR1) said linear prediction coefficients ai receives signals belonging to a set of excitation signals each including a shape contribution composed of a number of pulses, of predefined amplitudes and positions, said pulse number being much less than said first number; and generates a reconstructed signal yw(n) for each one of the excitation signals;
    - means (TS) for time shifting, by discrete steps, a set of samples yw(n) of said residual signal to align it in time with a reconstructed residual signal ss(n) generated by the long-term synthesis filter (LTS1) of said first filtering system, the set of samples of residual signal having a number of samples equal to said first number of samples, every shift step being chosen within an interval of allowed values;
    - a second filtering system (STS', SW'), that includes the series of a short-term synthesis filter and a spectral weighting filter identical to those (STS1, SW) of the first filtering system, is supplied with a modified residual signal generated by the time shift means for each of the values of said interval, and generates a reconstructed and weighted modified residual signal, said first and second filtering systems (LTS1, STS1, SW, STS', SW') separately determining a contribution representative of the memory of previous filtering and a contribution representative of a filtering with null initial conditions;
    - means (SM, EM) for generating a weighted error signal [e(n)] by comparing signals generated by the first and the second filtering systems, for identifying an optimal excitation signal and an optimal shift, by minimizing the energy of said weighted error signal, and for inserting in the coded signal information that identifies the optimal excitation signal;
       and further comprising, at the decoding side:
    - means (LTR2, STR2) for reconstructing the linear prediction coefficients and long-term analysis parameters starting from said indexes;
    - a third filtering system (LTS2, STS2), including the series of a long-term synthesis filter,and a short-term synthesis filter, identical to those (LTS1, STS1) of the first filtering system, for filtering an excitation signal selected, through information related to optimal excitation, in a set corresponding to the set used on the coding side and to generate a block of reconstructed speech signal samples,
    characterized in that:
    - the innovation pulses are the only non-null samples of words composed of said first number Ls of samples,
    - the innovation words for a first subset of excitation signals include a pair of pulses, a limited group of words of the first set being key-words in which the two pulses are placed in predetermined key positions and the other words in the subset being obtained from each of the key-words by each time simultaneously shifting the pulses by one position towards a word end, till one of the pulses reaches said end or the key position of the other pulse in the starting word, the shifting direction being the same for all words; and
    - the innovation words for a second subset of excitation signals include only one pulse whose position is different for each signal;
       and in that, in said error signal generating means (SM, EM), the means to minimize error energy are composed of a processing unit arranged to:
    - determine said pulse response [Q(n)] and an energy (Eq) thereof for each one of the possible pulse positions in excitation signals;
    - determine a first partial error signal [e₁(n)], represented by the difference between the reconstructed and weighted modified signal [xw(n)] and a contribution [yw1(n)] of the excitation signal filtering memory, and an energy of the error signal itself;
    - determine a first correlation [R(e₁q)] between said first partial error signal [e₁(n)] and the pulse response for each of the pulses of an excitation signal;
    - determine, for each excitation signal, starting from said pulse responses, a signal [u(n)] representative of a contribution of the filtering with null initial conditions of the excitation signal;
    - determine the energy [E(u)] of said signal [u(n)] representative of the contribution of a filtering with null initial conditions of the excitation signal and a second correlation R(e₁u) between said signal [u(n)] representative of the contribution of the filtering with null initial conditions of the excitation signal and the first partial error signal [e₁(u)];
    - determine, for each excitation signal, an optimal value of the amplitude contribution as ratio between said second correlation and the energy of the signal resulting from filtering with null initial conditions;
    - compute, as function of said second correlation R(e₁u), of said energy (Eu) of the signal representative of the contribution of the filtering with null initial conditions of the excitation and of said energy [E(e₁)] of the first partial error signal, the error signal energy value for each excitation signal.
  29. Apparatus according to claim 28, characterized in that a low-pass filter (FPB) is provided between said linear prediction filter (LPC) and said long-term analysis means (LTA, LTR1).
  30. Apparatus according to claim 28 or 29, characterized in that the short-term analysis means (STA, STR1) in the coder and the means (STR2) for reconstructing linear prediction coefficients in the decoder include means for carrying out, on said representation in the frequency domain, a linear interpolation between values related to two consecutive validity periods and supply the short-term synthesis filters (STS1, STS', STS2) of said filtering systems with the interpolated values in an initial part of a validity period of a set of coefficients.
  31. Apparatus according to any one of claims from 28 to 30, characterized in that the long-term analysis means (LTA, LTR1) in the coder and the means (LTR2) for reconstructing the long-term analysis parameters in the decoder include comparing means for comparing parameters related to two consecutive validity periods and generating a flag (F) to enable carrying out an interpolation between the parameters when they satisfy predetermined conditions, and the long-term synthesis filters (LTS1, LTS2) of the first and second filtering systems are associated to means that, when said flag is present, carry out a second-order polynomial interpolation of said parameters, extended to a whole validity period thereof, and supply the respective long-term synthesis filter (LTS1, LTS2) with the interpolated parameters.
  32. Apparatus according to any one of claims from 28 to 31, characterized in that the time shift means (TS) include a circuit (US) for upsampling the residual signal, and storing means (SH) for storing, for each block of samples to be coded, a first group of upsampled residual signal samples corresponding to said first number Ls of samples, and two further groups of upsampled residual signal samples, respectively preceding and following said first group and including a number of samples linked to the maximum allowed shift, and for supplying the second filtering system (STS', STW'), upon command by the energy minimizing means (EM), with a fourth group of upsampled residual signal samples, including as many samples as those of the first group and shifted with respect to the first group by said optimal shift.
EP94105438A 1993-04-09 1994-04-07 Speech coder employing analysis-by-synthesis techniques with a pulse excitation Withdrawn EP0619574A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IT93TO000244A IT1264766B1 (en) 1993-04-09 1993-04-09 VOICE CODER USING PULSE EXCITATION ANALYSIS TECHNIQUES.
ITTO930244 1993-04-09

Publications (1)

Publication Number Publication Date
EP0619574A1 true EP0619574A1 (en) 1994-10-12

Family

ID=11411368

Family Applications (1)

Application Number Title Priority Date Filing Date
EP94105438A Withdrawn EP0619574A1 (en) 1993-04-09 1994-04-07 Speech coder employing analysis-by-synthesis techniques with a pulse excitation

Country Status (5)

Country Link
EP (1) EP0619574A1 (en)
JP (1) JPH075899A (en)
CA (1) CA2120902A1 (en)
FI (1) FI941648A (en)
IT (1) IT1264766B1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996021218A1 (en) * 1995-01-06 1996-07-11 Matra Communication Speech coding method using synthesis analysis
WO1996021220A1 (en) * 1995-01-06 1996-07-11 Matra Communication Speech coding method using synthesis analysis
EP0766231A2 (en) * 1995-09-29 1997-04-02 Rockwell International Corporation Spike code-excited linear prediction
EP0858069A1 (en) * 1996-08-02 1998-08-12 Matsushita Electric Industrial Co., Ltd. Voice encoder, voice decoder, recording medium on which program for realizing voice encoding/decoding is recorded and mobile communication apparatus
WO1998035341A2 (en) * 1997-02-10 1998-08-13 Koninklijke Philips Electronics N.V. Transmission system for transmitting speech signals
US5899968A (en) * 1995-01-06 1999-05-04 Matra Corporation Speech coding method using synthesis analysis using iterative calculation of excitation weights
WO2002099787A1 (en) * 2001-06-04 2002-12-12 Qualcomm Incorporated Fast code-vector searching
WO2010058931A2 (en) * 2008-11-14 2010-05-27 Lg Electronics Inc. A method and an apparatus for processing a signal
CN102194462A (en) * 2006-03-10 2011-09-21 松下电器产业株式会社 Fixed codebook searching apparatus

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6334648B1 (en) 1997-03-21 2002-01-01 Girsberger Holding Ag Vehicle seat
US7236928B2 (en) 2001-12-19 2007-06-26 Ntt Docomo, Inc. Joint optimization of speech excitation and filter parameters
EP1870880B1 (en) 2006-06-19 2010-04-07 Sharp Kabushiki Kaisha Signal processing method, signal processing apparatus and recording medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0195487A1 (en) * 1985-03-22 1986-09-24 Koninklijke Philips Electronics N.V. Multi-pulse excitation linear-predictive speech coder
US4890328A (en) * 1985-08-28 1989-12-26 American Telephone And Telegraph Company Voice synthesis utilizing multi-level filter excitation
US5293449A (en) * 1990-11-23 1994-03-08 Comsat Corporation Analysis-by-synthesis 2,4 kbps linear predictive speech codec

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0195487A1 (en) * 1985-03-22 1986-09-24 Koninklijke Philips Electronics N.V. Multi-pulse excitation linear-predictive speech coder
US4890328A (en) * 1985-08-28 1989-12-26 American Telephone And Telegraph Company Voice synthesis utilizing multi-level filter excitation
US5293449A (en) * 1990-11-23 1994-03-08 Comsat Corporation Analysis-by-synthesis 2,4 kbps linear predictive speech codec

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5899968A (en) * 1995-01-06 1999-05-04 Matra Corporation Speech coding method using synthesis analysis using iterative calculation of excitation weights
WO1996021220A1 (en) * 1995-01-06 1996-07-11 Matra Communication Speech coding method using synthesis analysis
FR2729246A1 (en) * 1995-01-06 1996-07-12 Matra Communication SYNTHETIC ANALYSIS-SPEECH CODING METHOD
FR2729247A1 (en) * 1995-01-06 1996-07-12 Matra Communication SYNTHETIC ANALYSIS-SPEECH CODING METHOD
WO1996021218A1 (en) * 1995-01-06 1996-07-11 Matra Communication Speech coding method using synthesis analysis
US5974377A (en) * 1995-01-06 1999-10-26 Matra Communication Analysis-by-synthesis speech coding method with open-loop and closed-loop search of a long-term prediction delay
AU697892B2 (en) * 1995-01-06 1998-10-22 Matra Communication Analysis-by-synthesis speech coding method
US5963898A (en) * 1995-01-06 1999-10-05 Matra Communications Analysis-by-synthesis speech coding method with truncation of the impulse response of a perceptual weighting filter
AU704229B2 (en) * 1995-01-06 1999-04-15 Matra Communication Analysis-by-synthesis speech coding method
EP0766231A2 (en) * 1995-09-29 1997-04-02 Rockwell International Corporation Spike code-excited linear prediction
EP0766231A3 (en) * 1995-09-29 1998-06-17 Rockwell International Corporation Spike code-excited linear prediction
EP0858069A1 (en) * 1996-08-02 1998-08-12 Matsushita Electric Industrial Co., Ltd. Voice encoder, voice decoder, recording medium on which program for realizing voice encoding/decoding is recorded and mobile communication apparatus
EP1553564A2 (en) * 1996-08-02 2005-07-13 Matsushita Electric Industrial Co., Ltd. Voice encoding device, voice decoding device, recording medium for recording program for realizing voice encoding /decoding and mobile communication device
EP0858069A4 (en) * 1996-08-02 2000-08-23 Matsushita Electric Ind Co Ltd Voice encoder, voice decoder, recording medium on which program for realizing voice encoding/decoding is recorded and mobile communication apparatus
EP1553564A3 (en) * 1996-08-02 2005-10-19 Matsushita Electric Industrial Co., Ltd. Voice encoding device, voice decoding device, recording medium for recording program for realizing voice encoding /decoding and mobile communication device
WO1998035341A2 (en) * 1997-02-10 1998-08-13 Koninklijke Philips Electronics N.V. Transmission system for transmitting speech signals
WO1998035341A3 (en) * 1997-02-10 1998-11-12 Koninkl Philips Electronics Nv Transmission system for transmitting speech signals
WO2002099787A1 (en) * 2001-06-04 2002-12-12 Qualcomm Incorporated Fast code-vector searching
US6766289B2 (en) 2001-06-04 2004-07-20 Qualcomm Incorporated Fast code-vector searching
CN1306473C (en) * 2001-06-04 2007-03-21 高通股份有限公司 Fast code-vector searching
KR100935174B1 (en) * 2001-06-04 2010-01-06 콸콤 인코포레이티드 Fast code-vector searching
CN102194462A (en) * 2006-03-10 2011-09-21 松下电器产业株式会社 Fixed codebook searching apparatus
CN102194462B (en) * 2006-03-10 2013-02-27 松下电器产业株式会社 Fixed codebook searching apparatus
WO2010058931A2 (en) * 2008-11-14 2010-05-27 Lg Electronics Inc. A method and an apparatus for processing a signal
WO2010058931A3 (en) * 2008-11-14 2010-08-05 Lg Electronics Inc. A method and an apparatus for processing a signal

Also Published As

Publication number Publication date
ITTO930244A0 (en) 1993-04-09
IT1264766B1 (en) 1996-10-04
CA2120902A1 (en) 1994-10-10
JPH075899A (en) 1995-01-10
FI941648A0 (en) 1994-04-08
FI941648A (en) 1994-10-10
ITTO930244A1 (en) 1994-10-09

Similar Documents

Publication Publication Date Title
EP0747882B1 (en) Pitch delay modification during frame erasures
US5127053A (en) Low-complexity method for improving the performance of autocorrelation-based pitch detectors
US7260521B1 (en) Method and device for adaptive bandwidth pitch search in coding wideband signals
US5864798A (en) Method and apparatus for adjusting a spectrum shape of a speech signal
JP5519334B2 (en) Open-loop pitch processing for speech coding
EP0422232B1 (en) Voice encoder
US6732070B1 (en) Wideband speech codec using a higher sampling rate in analysis and synthesis filtering than in excitation searching
JP4662673B2 (en) Gain smoothing in wideband speech and audio signal decoders.
US5359696A (en) Digital speech coder having improved sub-sample resolution long-term predictor
EP0747883A2 (en) Voiced/unvoiced classification of speech for use in speech decoding during frame erasures
US20050065785A1 (en) Indexing pulse positions and signs in algebraic codebooks for coding of wideband signals
EP0732686A2 (en) Low-delay code-excited linear-predictive coding of wideband speech at 32kbits/sec
US20040023677A1 (en) Method, device and program for coding and decoding acoustic parameter, and method, device and program for coding and decoding sound
USRE43190E1 (en) Speech coding apparatus and speech decoding apparatus
US5884251A (en) Voice coding and decoding method and device therefor
EP0450064B1 (en) Digital speech coder having improved sub-sample resolution long-term predictor
EP0619574A1 (en) Speech coder employing analysis-by-synthesis techniques with a pulse excitation
JP3357795B2 (en) Voice coding method and apparatus
EP0747884B1 (en) Codebook gain attenuation during frame erasures
US5692101A (en) Speech coding method and apparatus using mean squared error modifier for selected speech coder parameters using VSELP techniques
WO1997031367A1 (en) Multi-stage speech coder with transform coding of prediction residual signals with quantization by auditory models
JP3099852B2 (en) Excitation signal gain quantization method
JP3192051B2 (en) Audio coding device
JP3270146B2 (en) Audio coding device
Zinser et al. 4800 and 7200 bit/sec hybrid codebook multipulse coding

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH DE ES FR GB GR IT LI NL SE

17P Request for examination filed

Effective date: 19940929

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 19961101