EP1330818A1 - Method and system for speech frame error concealment in speech decoding - Google Patents

Method and system for speech frame error concealment in speech decoding

Info

Publication number
EP1330818A1
EP1330818A1 EP01983716A EP01983716A EP1330818A1 EP 1330818 A1 EP1330818 A1 EP 1330818A1 EP 01983716 A EP01983716 A EP 01983716A EP 01983716 A EP01983716 A EP 01983716A EP 1330818 A1 EP1330818 A1 EP 1330818A1
Authority
EP
European Patent Office
Prior art keywords
long
term prediction
speech
value
lag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
EP01983716A
Other languages
German (de)
French (fr)
Other versions
EP1330818B1 (en
Inventor
Jari MÄKINEN
Hannu J. Mikkola
Janne Vainio
Jani Rotola-Pukkila
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Oyj
Original Assignee
Nokia Oyj
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Oyj filed Critical Nokia Oyj
Publication of EP1330818A1 publication Critical patent/EP1330818A1/en
Application granted granted Critical
Publication of EP1330818B1 publication Critical patent/EP1330818B1/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/005Correction of errors induced by the transmission channel, if related to the coding algorithm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals

Definitions

  • the present invention relates generally to the decoding of speech signals from an encoded bit stream and, more particularly, to the concealment of corrupted speech parameters when errors in speech frames are detected during speech decoding.
  • Speech and audio coding algorithms have a wide variety of applications in communication, multimedia and storage systems.
  • the development of the coding algorithms is driven by the need to save transmission and storage capacity while maintaining the high quality of the synthesized signal.
  • the complexity of the coder is limited by, for example, the processing power of the application platform.
  • the encoder may be highly complex, while the decoder should be as simple as possible.
  • Modern speech codecs operate by processing the speech signal in short segments called frames.
  • a typical frame length of a speech codec is 20 ms, which corresponds to 160 speech samples, assuming an 8 kHz sampling frequency. In the wide band codecs, the typical frame length of 20 ms corresponds to 320 speech samples, assuming a 16 kHz sampling frequency.
  • the frame may be further divided into a number of sub-frames.
  • the encoder determines a parametric representation of the input signal. The parameters are quantized and transmitted through a communication channel (or stored in a storage medium) in a digital form. The decoder produces a synthesized speech signal based on the received parameters, as shown in Figure 1.
  • a typical set of extracted coding parameters includes spectral parameters (such as Linear Predictive Coding (LPC) parameters) to be used in short term prediction of the signal, parameters to be used for long term prediction (LTP) of the signal, various gain parameters, and excitation parameters.
  • LTP Linear Predictive Coding
  • the LTP parameter is closely related to the fundamental frequency of the speech signal. This parameter is often known as a so-called pitch-lag parameter, which describes the fundamental periodicity in terms of speech samples.
  • one of the gain parameters is very much related to the fundamental periodicity and so it is called LTP gain.
  • the LTP gain is a very important parameter in making the speech as natural as possible.
  • Speech parameters are transmitted through a communication channel in a digital form. Sometimes the condition of the communication channel changes, and that might cause errors to the bit stream. This will cause frame errors (bad frames), i.e., some of the parameters describing a particular speech segment (typically 20 ms) are corrupted. There are two kinds of frame errors: totally corrupted frames and partially corrupted frames. 0 These frames are sometimes not received in the decoder at all.
  • CELP Code-Excited Linear Prediction
  • the partially corrupted frame is a frame that does arrive to the receiver and can still contain some parameters that are not in 5 error. This is usually the situation in a circuit switched connection like in the existing GSM connection.
  • the bit-error rate (BER) in the partially corrupted frames is typically around 0.5-5%.
  • the lost or erroneous speech frames are consequences of the bad condition of the communication channel, which causes errors to the bit stream.
  • an error correction procedure is started.
  • This error correction procedure usually includes a substitution procedure and muting procedure.
  • the speech parameters of the bad frame are replaced by attenuated or modified values from the previous good frame.
  • some parameters such as excitation in CELP parameters
  • Figure 2 shows the principle of the prior-art method. As shown in Figure 2, a buffer labeled "parameter history" is used to store the speech parameters of the last good o frame.
  • the Bad Frame Indicator (BFI) is set to 1 and the error concealment procedure is started.
  • the parameter history is updated and speech parameters are used for decoding without error concealment.
  • the error concealment procedure uses the parameter history for concealing the lost or erroneous parameters in the corrupted frames.
  • AMR GSM Adaptive Multi-Rate
  • ETSI specification 06.91 the excitation vector from the channel is always used.
  • the speech frames are totally lost frames (e.g., in some IP-based transmission systems)
  • no parameters will be used from the received bad frame. In some cases, no frame will be received, or the frame will arrive so late that it has to be classified as a lost frame.
  • LTP-lag concealment uses the last good LTP-lag value with 0 a slightly modified fractional part, and spectral parameters are replaced by the last good parameters slightly shifted towards constant mean.
  • the gains may usually be replaced by the attenuated last good value or by the median of several last good values.
  • the same substituted speech parameters are used for all sub-frames with slight modification to some of them. 5
  • the prior-art LTP concealment may be adequate for stationary speech signals, for example, voiced or stationary speech. However, for non-stationary speech signals, the prior-art method may cause unpleasant and audible artifacts.
  • the present invention takes advantage of the fact that there is a recognizable relationship among the long-term prediction (LTP) parameters in the speech signals.
  • the LTP-lag has a strong correlation with the LTP-gain.
  • the LTP-lag is typically very stable and the variation between o adjacent lag values is small.
  • the speech parameters are indicative of a voiced-speech sequence.
  • the LTP-gain is low or unstable, the LTP-lag is typically unvoiced, and the speech parameters are indicative of an unvoiced-speech sequence.
  • the first aspect of the present invention is a method for concealing errors in an encoded bit stream indicative of speech signals received in a speech decoder, 5 wherein the encoded bit stream includes a plurality of speech frames arranged in speech sequences, and the speech frames include at least one corrupted frame preceded by one or more non-corrupted frames, wherein the corrupted frame includes a first long-term prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction 0 gain values, and wherein the second long-term prediction lag values include a last long- term prediction lag value, and the second long-term prediction gain values include a last long-term prediction gain value, and the speech sequences include stationary and non- stationary speech sequences, and wherein the corrupted frame can be partially corrupted or totally corrupted.
  • the method comprises the steps of: 5 determining whether the first long-term prediction lag value is within or outside an upper limit and a lower limit determined based on the second long-term prediction lag values; replacing the first long-term prediction lag value in the partially corrupted frame with a third lag value, when the first long-term prediction lag value is outside the upper o and lower limits; and retaining the first long-term prediction lag value in the partially corrupted frame when the first long-term prediction lag value is within the upper and lower limits.
  • the method comprises the steps of: determining whether the speech sequence in which the corrupted frame is arranged 5 is stationary or non-stationary, based on the second long-term prediction gain values; when the speech sequence is stationary, replacing the first long-term prediction lag value in the corrupted frame with the last long-term prediction lag value; and when the speech sequence is non-stationary, replacing the first long-term prediction lag value in the corrupted frame with a third long-term prediction lag value o determined based on the second long-term prediction lag values and an adaptively-limited random lag jitter, and replacing the first long-term prediction gain value in the corrupted frame with a third long-term prediction gain value determined based on the second long- term prediction gain values and an adaptively-limited random gain jitter.
  • the third long-term prediction lag value is calculated based at least partially on a weighted median of the second longrterm prediction lag values, and the adaptively-limited random lag jitter is a value bound by limits determined based on the 5 second long-term prediction lag values.
  • the third long-term prediction gain value is calculated based at least partially on a weighted median of the second long-term prediction gain values, and the adaptively-limited random gain jitter is a value bound by limits determined based on the second long-term prediction gain values.
  • the method comprises the steps of: determining whether the corrupted frame is partially corrupted or totally corrupted; replacing the first long-term prediction lag value in the corrupted frame with a third lag value if the corrupted frame is totally corrupted, wherein when the speech 5 sequence in which the totally corrupted frame is arranged is stationary, set the third lag value equal to the last long-term prediction lag value, and when said speech sequence is non-stationary, determining the third lag value based on the second long-term prediction values and an adaptively-limited random lag jitter; and replacing the first long-term prediction lag value in the corrupted frame with a o fourth lag value if the corrupted frame is partially corrupted., wherein when the speech sequence in which the partially corrupted frame is arranged in stationary, set the fourth lag value equal to the last long-term prediction lag value, and when said speech sequence is non-stationary set the fourth lag value based on a decoded long-term prediction lag
  • the second aspect of the present invention is a speech signal transmitter and receiver system for encoding speech signals in an encoded bit stream and decoding the encoded bit stream into synthesized speech, wherein the encoded bit stream includes a plurality of speech frames arranged in speech sequences, and the speech frames include at o least one corrupted frame preceded by one or more non-corrupted frames, wherein the corrupted frame is indicated by a first signal and includes a first long-term prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long-term prediction lag value, and the second long-term prediction gain values include a last long-term prediction gain value, and the speech sequences include stationary and non-stationary speech 5 sequences.
  • the system comprises: a first mechanism, responsive to the first signal, for determining whether the speech sequence in which the corrupted frame is arranged is stationary or non-stationary, based on the second long-term prediction gain values, and for providing a second signal indicative of whether the speech sequence is stationary or non-stationary; and 0 a second mechanism, responsive to the second signal, for replacing the first long- term prediction lag value in the corrupted frame with the last long-term prediction lag value when the speech sequence is stationary, and replacing the first long-term prediction lag value and the first long-term gain value in the corrupted frame with a third long-term prediction lag value and a third long-term prediction gain value, respectively, when the 5 speech sequence is non-stationary, wherein the third long-term prediction lag value is determined based on the second long-term prediction lag values and an adaptively-limited random lag jitter, and the third long-term prediction gain value is determined based on the second long-term prediction gain values and an adaptively-limited random gain jitter.
  • the third long-term prediction lag value is calculated based at least o partially on a weighted median of the second long-term prediction lag values, and the adaptively-limited random lag jitter is a value bound by limits determined based on the second long-term prediction lag values.
  • the third long-term prediction gain value is calculated based at least partially on a weighted median of the second long-term prediction gain values, and the 5 adaptively-limited random gain jitter is a value bound by limits determined based on the second long-term prediction gain values.
  • the third aspect of the present invention is a decoder for synthesizing speech from an encoded bit stream, wherein the encoded bit stream includes a plurality of speech frames arranged in speech sequences, and the speech frames include at least one corrupted o frame preceded by one or more non-corrupted frames, wherein the corrupted frame is indicated by a first signal and includes a first long-term prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long-term prediction lag value and the second long-term prediction gain values include a last long-term prediction gain value and the speech sequences include stationary and non-stationary speech sequences.
  • the decoder comprises: a first mechanism, responsive to the first signal, for determining whether the speech sequence in which the corrupted frame is arranged is stationary or non-stationary, based on the second long-term prediction gain values, and for providing a second signal indicative of whether the speech sequence is stationary or non-stationary; and a second mechanism, responsive to the second signal, for replacing the first long- term prediction lag value in the corrupted frame with the last long-term prediction lag value when the speech sequence is stationary, and replacing the first long-term prediction lag value and the first long-term gain value in the corrupted frame with a third long-term prediction lag value and a third long-term prediction gain value, respectively, when the speech sequence is non-stationary, wherein the third long-term prediction lag value is determined based on the second long-term prediction lag values and an adaptively-limited random lag jitter, and the third long-term prediction gain value is determined based on the second long-term prediction gain values and an adaptively-limited random gain jitter.
  • the fourth aspect of the present invention is a mobile station, which is arranged to receive an encoded bit stream containing speech data indicative of speech signals, wherein the encoded bit stream includes a plurality of speech frames arranged in speech sequences, and the speech frames include at least one corrupted frame preceded by one or more non-corrupted frames, wherein the corrupted frame is indicated by a first signal and includes a first long-term prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long-term prediction lag value and the second long-term prediction gain values include a last long-term prediction gain value and the speech sequences include stationary and non-stationary speech sequences.
  • the mobile station comprises: a first mechanism, responsive to the first signal, for determining whether the speech sequence in which the corrupted frame is arranged is stationary or non-stationary, based on the second long-te ⁇ n prediction gain values, and for providing a second signal indicative of whether the speech sequence is stationary or non-stationary; and a second mechanism, responsive to the second signal, for replacing the first long- term prediction lag value in the corrupted frame with the last long-term prediction lag value when the speech sequence is stationary, and replacing the first long-term prediction lag value and the first long-term gain value in the corrupted frame with a third long-term prediction lag value and a third long-term prediction gain value, respectively, when the speech sequence is non-stationary, wherein the third long-term prediction lag value is determined based on the second long-term prediction lag values and an adaptively-limited random lag jitter, and the third long-term prediction gain value is determined based on the second long-term prediction gain values and an adaptively-limited random gain jitter.
  • the fifth aspect of the present invention is an element in a telecommunication network, which is arranged to receive an encoded bit stream containing speech data from a mobile station, wherein the speech data includes a plurality of speech frames arranged in speech sequences, and the speech frames include at least one corrupted frame preceded by one or more non-corrupted frames, wherein the corrupted frame is indicated by a first signal and includes a first long-term prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long-term prediction lag value and the second long-term prediction gain values include a last long-term prediction gain value and the speech sequences include stationary and non-stationary speech sequences.
  • the element comprises: a first mechanism, responsive to the first signal, for determining whether the speech sequence in which the corrupted frame is arranged is stationary or non-stationary, based on the second long-term prediction gain values, and for providing a second signal indicative of whether the speech sequence is stationary or non-stationary; and a second mechanism, responsive to the second signal, for replacing the first long- term prediction lag value in the corrupted frame with the last long-term prediction lag value when the speech sequence is stationary, and replacing the first long-term prediction lag value and the first long-term gain value in the corrupted frame with a third long-term prediction lag value and a third long-term prediction gain value, respectively, when the speech sequence is non-stationary, wherein the third long-term prediction lag value is determined based on the second long-term prediction lag values and an adaptively-limited random lag jitter, and the third long-term prediction gain value is determined based on the second long-term prediction gain values and an adaptively-limited random gain jitter.
  • Figure 1 is a block diagram illustrating a generic distributed speech codec, wherein the encoded bit stream containing speech data is conveyed from an encoder to a 0 decoder via a communication channel or a storage medium.
  • Figure 2 is a block diagram illustrating a prior-art error concealment apparatus in a receiver.
  • Figure 3 is a block diagram illustrating the error concealment apparatus in a receiver, according to the present invention.
  • Figure 4 is a flow chart illustrating the method of error concealment according to the present invention.
  • Figure 5 is a diagrammatic representation of a mobile station, which includes an error concealment module, according to the present invention.
  • Figure 6 is a diagrammatic representation of a telecommunication network using a o decoder, according to the present invention.
  • Figure 7 is a plot of LTP-parameters illustrating the lag and gain profiles in a voiced speech sequence.
  • Figure 8 is a plot of LTP-parameters illustrating the lag and gain profiles in an unvoiced speech sequence.
  • Figure 9 is a plot of LTP-lag values in a series of sub-frames illustrating the difference between the prior-art error concealment approach and the approach according to the present invention.
  • Figure 10 is another plot of LTP-lag values in a series of sub-frames illustrating the difference between the prior-art error concealment approach and the approach o according to the present invention.
  • Figure 1 la is a plot of speech signals illustrating an error-free speech sequence having the location of the bad frame of the speech channel, as shown in Figures 1 lb and lie.
  • Figure 1 lb is a plot of speech signals illustrating the concealment of parameters in a bad frame according to the prior art approach.
  • Figure lie is a plot of speech signals illustrating the concealment of parameters in a bad frame according to the present invention.
  • FIG. 3 illustrates a decoder 10, which includes a decoding module 20 and an error concealment module 30.
  • the decoding module 20 receives a signal 140, which is normally indicative of speech parameters 102 for speech synthesis.
  • the decoding module 20 is known in the art.
  • the error concealment module 30 is arranged to receive an encoded bit stream 100, which includes a plurality of speech streams arranged in speech sequences.
  • a bad-frame detection device 32 is used to detect corrupted frames in the speech sequences and provide a Bad-Frame-Indicator (BFI) signal 110 representing a BFI flag when a corrupted frame is detected.
  • BFI is also known in the art.
  • the BFI signal 110 is used to control two switches 40 and 42.
  • the terminal S is operatively connected to the terminal 0 in the switches 40 and 42.
  • the speech parameters 102 are conveyed to a buffer, or "parameter history" storage, 50 and the decoding module 20 for speech synthesis.
  • the BFI flag is set to 1.
  • the terminal S is connected to the terminal 1 in the switches 40 and 42. Accordingly, the speech parameters 102 are provided to an analyzer 70, and the speech parameters needed for speech synthesis are provided by a parameter concealment module 60 to the decoding module 20.
  • the speech parameters 102 typically include LPC parameters for short term prediction, excitation parameters, a long-term prediction (LTP) lag parameter, an LTP gain parameter and other gain parameters.
  • the parameter history storage 50 is used to store the LTP-lag and LTP-gain of a number of non-corrupted speech frames. The contents of the parameter history storage 50 are constantly updated so that the last LTP- gain parameter and the last LTP-lag parameter stored in the storage 50 are those of the last non-corrupted speech frame.
  • the BFI flag is set to 1 and the speech parameters 102 of the corrupted frame are conveyed to the analyzer 70 through the switch 40.
  • the analyzer 70 By comparing the LTP-gain parameter in the corrupted frame and the LTP-gain parameters stored in the storage 50, it is possible for the analyzer 70 to determine whether the speech sequence is stationary or non-stationary, based on the magnitude and its variation in the LTP-gain parameters in neighboring frames.
  • the LTP-gain parameters are 5 high and reasonably stable, the LTP-lag value is stable and the variation in adjacent LTP- lag values is small, as shown in Figure 7.
  • the LTP-gain parameters are low and unstable, and the LTP-lag is also unstable, as shown in Figure 8.
  • the LTP-lag values are changing more or less randomly.
  • Figure 7 shows the speech sequence for the word "viini ⁇ ”.
  • Figure 8 shows the speech sequence for the word 0 "exhibition ' ".
  • the last good LTP-lag is retrieved from the storage 50 and conveyed to the parameter concealment module 60.
  • the retrieved good LTP-lag is used to replace the LTP-lag of the corrupted frame. Because the LTP-lag in a stationary speech sequence is stable and its s variation is small, it is reasonable to use a previous LTP-lag with small modification to conceal the corresponding parameter in corrupted frame. Subsequently, an RX signal 104 causes the replacement parameters, as denoted by reference numeral 134, to be conveyed to the decoding module 20 through the switch 42.
  • the analyzer 70 calculates a replacement LTP-lag value and a replacement
  • LTP-gain value for parameter concealment. Because LTP-lag in an non-stationary speech sequence is unstable and its variation in adjacent frames is typically very large, parameter concealment should allow the LTP-lag in an error-concealed non-stationary sequence to fluctuate in a random fashion. If the parameters in the corrupted frame are totally 5 corrupted, such as in a lost frame, the replacement LTP-lag is calculated by using a weighted median of the previous good LTP-lag values along with an adaptively-limited random jitter. The adaptively-limited random jitter is allowed to vary within limits calculated from the history of the LTP values, so that the parameter fluctuation in an error-concealed segment is similar to the previous good section of the same speech o sequence.
  • An exemplary rule for LTP-lag concealment is governed by a set of conditions as follows: If minGain > 0.5 AND LagDif ⁇ 10; OR lastGain > 0.5 AND secondLastGain > 0.5,
  • Updatejag a weighted average of the LTP-lag buffer with randomization, is used for the totally corrupted frame. Updatejag is calculated in a manner as described below:
  • the LTP-lag buffer is sorted and the three biggest buffer values are retrieved.
  • the 0 average of these three biggest values is referred to as the weighted average lag (WAV), and the difference from these biggest values is referred to as the weighted lag difference (WLD).
  • WAV weighted average lag
  • WLD weighted lag difference
  • Updatejag WAL + RAND ⁇ -WLD/2, WLD/2), 5 wherein minGain is the smallest value of the LTP-gain buffer; LagDif ' is the difference between the smallest and the largest LTP-lag values; lastGain is the last received good LTP-gain; and o secondLastGain is the second last received good LTP-gain.
  • the LTP-lag value in the corrupted frame is replaced accordingly. That the frame is partially corrupted is determined by a set of exemplary LTP-feature criteria given below: 5
  • Tbf is a decoded LTP lag which is searched, when the BFI is set, from the adaptive codebook as if the BFI is not set.
  • Figures 9 and 1 Two examples of parameter concealment are shown in Figures 9 and 1 .
  • the profile of the replacement LTP-lag values in the bad frame, according to the prior art is rather flat, but the profile of the replacement, according to the present invention, allows some fluctuation, similar to the error-free profile.
  • the difference between the prior art approach and the present invention is further illustrated in Figures 1 lb and lie, respectively, based on the speech signals in an error-free channel, as shown in Figure 11a.
  • the parameter concealment can be further optimized.
  • the LTP-lags in the corrupted frames may still yield an acceptable synthesized speech segment.
  • the BFI flag is set by a Cyclic Redundancy Check (CRC) mechanism or other error detection mechanisms.
  • CRC Cyclic Redundancy Check
  • the BER per frame is a good indicator for the channel condition.
  • the BER per frame is small and a high percentage of the LTP-lag values in the erroneous frames are correct. For example, when the frame error rate (FER) is 0.2%, over 70% of the LTP-lag values are correct. Even when the FER reaches 3%, about 60% of the LTP-lag values are still correct.
  • the CRC can accurately detect a bad frame and set the BFI flag accordingly. However, the CRC does not provide an estimation of the BER in the frame.
  • the BFI flag is used as the only criterion for parameter concealment, then a high percentage of the correct LTP-lag values could be wasted, h order to prevent a large amount of correct LTP-lags from being thrown away, it is possible to adapt a decision criterion for 5 parameter concealment based on the LTP history. It is also possible to use the FER, for example, as the decision criterion. If the LTP-lag meets the decision criterion, no parameter concealment is necessary. In that case, the analyzer 70 conveys the speech parameters 102, as received through the switch 40, to the parameter concealment module 60 which then conveys the same to the decoding module 20 through the switch 42. If the 0 LTP-lag does not meet that decision criterion, then the corrupted frame is further examined using the LTP-feature criteria, as described hereinabove, for parameter concealment.
  • the LTP-lag In stationary speech sequences, the LTP-lag is very stable. Whether most of the LTP-lag values in a corrupted frame are correct or erroneous can be correctly predicted 5 with high probability. Thus, it is possible to adapt a very strict criterion for parameter concealment. In non-stationary speech sequences, it may be difficult to predict whether the LTP-lag value in a corrupted frame is correct, because of the unstable nature of the LTP parameters. However, that the prediction is correct or wrong is less important in non-stationary speech than in stationary speech.
  • the LTP-gain fluctuates greatly in non-stationary speech. If 5 the same LTP-gain value from the last good frame is used repeatedly to replace the LTP- gain value of one or more corrupted frames in a speech sequence, the LTP-gain profile in the gain concealed segment will be flat (similar to the prior-art LTP-lag replacement, as shown in Figures 7 and 8), in stark contrast to the fluctuating profile of the non-corrupted frames.
  • the sudden change in the LTP-gain profile may cause unpleasant audible o artifacts.
  • the analyzer 70 can be also used to determine the limits between which the replacement LTP-gain value is allowed to fluctuate based on the gain values in the LTP history.
  • LTP-gain concealment can be carried out in a manner as described below.
  • a replacement LTP-gain value is calculated according to a set of LTP-gain concealment rules.
  • the replacement LTP-gain is denoted as Updated_gain.
  • Updatedjgain cannot be larger than lastGain. If the previous conditions cannot be met, the following conditions are used:
  • Figure 4 illustrates the method of error-concealment, according to the present invention.
  • the frame is checked to see if it is corrupted at step 162. If the frame is not corrupted, then the parameter history of the speech sequence is updated at step 164, and the speech parameters of the current frame are decoded at step 166. The procedure then goes back to step 162. If the frame is 0 bad or corrupted, the parameters are retrieved from the parameter history storage at step 170. Whether the corrupted frame is part of the stationary speech sequence or non- stationary speech sequence is determined at step 172. If the speech sequence is stationary, the LTP-lag of the last good frame is used to replace the LTP-lag in the corrupted frame at step 174. If the speech sequence is non-stationary, a new lag value and new gain value 5 are calculated based on the LTP history at step 180, and they are used to replace the corresponding parameters in the corrupted frame at step 182.
  • FIG. 5 shows a block diagram of a mobile station 200 according to one exemplary embodiment of the invention.
  • the mobile station comprises parts typical of the device, such as a microphone 201, keypad 207, display 206, earphone 214, o transmit/receive switch 208, antenna 209 and control unit 205.
  • the figure shows transmitter and receiver blocks 204, 211 typical of a mobile station.
  • the transmitter block 204 comprises a coder 221 for coding the speech signal.
  • the transmitter block 204 also comprises operations required for channel coding, deciphering and modulation as well as RF functions, which have not been drawn in Figure 5 for clarity.
  • the receiver block 211 also comprises a decoding block 220 according to the invention.
  • Decoding block 220 comprises an error concealment module 222 like the parameter concealment module 30 shown in Figure 3.
  • the signal to be received is taken from the antenna via the transmit/receive switch 208 to the receiver block 211, which demodulates the received signal and decodes the deciphering and the channel coding.
  • the resulting speech signal is taken via the D/A converter 212 to an amplifier 213 and further to an earphone 214.
  • the control unit 205 controls the operation of the mobile station 200, reads the control commands given by the user from the keypad 5 207 and gives messages to the user by means of the display 206.
  • the parameter concealment module 30, can also be used in a telecommunication network 300, such as an ordinary telephone network, or a mobile station network, such as the GSM network.
  • Figure 6 shows an example of a block diagram of such a telecommunication network.
  • the telecommunication 0 network 300 can comprise telephone exchanges or corresponding switching systems 360, to which ordinary telephones 370, base stations 340, base station controllers 350 and other central devices 355 of telecommunication networks are coupled.
  • Mobile stations 330 can establish connection to the telecommunication network via the base stations 340.
  • a decoding block 320 which includes an error concealment module 322 similar to the error 5 concealment module 30 shown in Figure 3, can be particularly advantageously placed in the base station 340, for example.
  • the decoding block 320 can also be placed in the base station controller 350 or other central or switching device 355, for example. If the mobile station system uses separate transcoders, for example, between the base stations and the base station controllers, for transforming the coded signal taken over the o radio channel into a typical 64 kbit/s signal transferred in a telecommunication system and vice versa, the decoding block 320 can also be placed in such a transcoder. In general, the decoding block 320, including the parameter concealment module 322, can be placed in any element of the telecommunication network 300, which transforms the coded data stream into an uncoded data stream. The decoding block 320 decodes and filters the 5 coded speech signal coming from the mobile station 330, whereafter the speech signal can be transferred in the usual manner as uncompressed forward in the telecommunication network 300.
  • the error concealment method of the present invention has been described with respect to stationary and non-stationary speech sequences, and that o stationary speech sequences are usually voiced and non-stationary speech sequences are usually unvoiced.
  • the disclosed method is applicable to error concealment in voiced and unvoiced speech sequences.
  • the present invention is applicable to CELP type speech codecs and can be adapted to other types of speech codecs as well.
  • the invention has been described with respect to a preferred embodiment thereof, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the spirit and scope of this invention.

Abstract

A method and system for concealing errors in one or more bad frames in a speech sequence as part of an encoded bit stream received in a decoder. When the speech sequence is voiced, the LTP-parameters in the bad frames are replaced by the corresponding parameters in the last frame. When the speech sequence is unvoiced, the LTP-parameters in the bad frames are replaced by values calculated based on the LTP history along with an adaptively-limited random term.

Description

METHOD AND SYSTEM FOR SPEECH FRAME ERROR CONCEALMENT IN
SPEECH DECODING
Field of the Invention The present invention relates generally to the decoding of speech signals from an encoded bit stream and, more particularly, to the concealment of corrupted speech parameters when errors in speech frames are detected during speech decoding.
Background of the Invention Speech and audio coding algorithms have a wide variety of applications in communication, multimedia and storage systems. The development of the coding algorithms is driven by the need to save transmission and storage capacity while maintaining the high quality of the synthesized signal. The complexity of the coder is limited by, for example, the processing power of the application platform. In some applications, for example, voice storage, the encoder may be highly complex, while the decoder should be as simple as possible.
Modern speech codecs operate by processing the speech signal in short segments called frames. A typical frame length of a speech codec is 20 ms, which corresponds to 160 speech samples, assuming an 8 kHz sampling frequency. In the wide band codecs, the typical frame length of 20 ms corresponds to 320 speech samples, assuming a 16 kHz sampling frequency. The frame may be further divided into a number of sub-frames. For every frame, the encoder determines a parametric representation of the input signal. The parameters are quantized and transmitted through a communication channel (or stored in a storage medium) in a digital form. The decoder produces a synthesized speech signal based on the received parameters, as shown in Figure 1.
A typical set of extracted coding parameters includes spectral parameters (such as Linear Predictive Coding (LPC) parameters) to be used in short term prediction of the signal, parameters to be used for long term prediction (LTP) of the signal, various gain parameters, and excitation parameters. The LTP parameter is closely related to the fundamental frequency of the speech signal. This parameter is often known as a so-called pitch-lag parameter, which describes the fundamental periodicity in terms of speech samples. Also, one of the gain parameters is very much related to the fundamental periodicity and so it is called LTP gain. The LTP gain is a very important parameter in making the speech as natural as possible. The description of the coding parameters above fits in general terms with a variety of speech codecs, including the so-called Code-Excited Linear Prediction (CELP) codecs, which have for some time been the most successful speech codecs. 5 Speech parameters are transmitted through a communication channel in a digital form. Sometimes the condition of the communication channel changes, and that might cause errors to the bit stream. This will cause frame errors (bad frames), i.e., some of the parameters describing a particular speech segment (typically 20 ms) are corrupted. There are two kinds of frame errors: totally corrupted frames and partially corrupted frames. 0 These frames are sometimes not received in the decoder at all. In the packet-based transmission systems, like in normal internet connections, the situation can arise when the data packet will never reach the receiver, or the data packet arrives so late that it cannot be used because of the real time nature of spoken speech. The partially corrupted frame is a frame that does arrive to the receiver and can still contain some parameters that are not in 5 error. This is usually the situation in a circuit switched connection like in the existing GSM connection. The bit-error rate (BER) in the partially corrupted frames is typically around 0.5-5%.
From the description above, it can be seen that the two cases of bad or corrupted frames will require different approaches in dealing with the degradation in reconstructed o speech due to the loss of speech parameters.
The lost or erroneous speech frames are consequences of the bad condition of the communication channel, which causes errors to the bit stream. When an error is detected in the received speech frame, an error correction procedure is started. This error correction procedure usually includes a substitution procedure and muting procedure. In 5 the prior art, the speech parameters of the bad frame are replaced by attenuated or modified values from the previous good frame. However, some parameters (such as excitation in CELP parameters) in the corrupted frame may still be used for decoding. Figure 2 shows the principle of the prior-art method. As shown in Figure 2, a buffer labeled "parameter history" is used to store the speech parameters of the last good o frame. When a bad frame is detected, the Bad Frame Indicator (BFI) is set to 1 and the error concealment procedure is started. When the BFI is not set (BFI=0), the parameter history is updated and speech parameters are used for decoding without error concealment. In the prior-art system, the error concealment procedure uses the parameter history for concealing the lost or erroneous parameters in the corrupted frames. Some speech parameters may be used from the received frame even though it is classified as a bad frame (BFI=1). For example, in a GSM Adaptive Multi-Rate (AMR) speech codec 5 (ETSI specification 06.91), the excitation vector from the channel is always used. When the speech frames are totally lost frames (e.g., in some IP-based transmission systems), no parameters will be used from the received bad frame. In some cases, no frame will be received, or the frame will arrive so late that it has to be classified as a lost frame.
In a prior-art system, LTP-lag concealment uses the last good LTP-lag value with 0 a slightly modified fractional part, and spectral parameters are replaced by the last good parameters slightly shifted towards constant mean. The gains (LTP and fixed codebook) may usually be replaced by the attenuated last good value or by the median of several last good values. The same substituted speech parameters are used for all sub-frames with slight modification to some of them. 5 The prior-art LTP concealment may be adequate for stationary speech signals, for example, voiced or stationary speech. However, for non-stationary speech signals, the prior-art method may cause unpleasant and audible artifacts. For example, when the speech signal is unvoiced or non-stationary, simply substituting the lag value in the bad frame with the last good lag value has the effect of generating a short voiced-speech o segment in the middle of an unvoiced-speech burst (See Figure 10). The effect, as known as the "bing" artifact, can be annoying.
It is advantageous and desirable to provide a method and system for error concealment in speech decoding to improve the speech quality.
5 Summary of the Invention
The present invention takes advantage of the fact that there is a recognizable relationship among the long-term prediction (LTP) parameters in the speech signals. In particular, the LTP-lag has a strong correlation with the LTP-gain. When the LTP-gain is high and reasonably stable, the LTP-lag is typically very stable and the variation between o adjacent lag values is small. In that case, the speech parameters are indicative of a voiced-speech sequence. When the LTP-gain is low or unstable, the LTP-lag is typically unvoiced, and the speech parameters are indicative of an unvoiced-speech sequence. Once the speech sequence is classified as stationary (voiced) or non-stationary (unvoiced), the corrupted or bad frame in the sequence can be processed differently.
Accordingly, the first aspect of the present invention is a method for concealing errors in an encoded bit stream indicative of speech signals received in a speech decoder, 5 wherein the encoded bit stream includes a plurality of speech frames arranged in speech sequences, and the speech frames include at least one corrupted frame preceded by one or more non-corrupted frames, wherein the corrupted frame includes a first long-term prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction 0 gain values, and wherein the second long-term prediction lag values include a last long- term prediction lag value, and the second long-term prediction gain values include a last long-term prediction gain value, and the speech sequences include stationary and non- stationary speech sequences, and wherein the corrupted frame can be partially corrupted or totally corrupted. The method comprises the steps of: 5 determining whether the first long-term prediction lag value is within or outside an upper limit and a lower limit determined based on the second long-term prediction lag values; replacing the first long-term prediction lag value in the partially corrupted frame with a third lag value, when the first long-term prediction lag value is outside the upper o and lower limits; and retaining the first long-term prediction lag value in the partially corrupted frame when the first long-term prediction lag value is within the upper and lower limits.
Alternatively, the method comprises the steps of: determining whether the speech sequence in which the corrupted frame is arranged 5 is stationary or non-stationary, based on the second long-term prediction gain values; when the speech sequence is stationary, replacing the first long-term prediction lag value in the corrupted frame with the last long-term prediction lag value; and when the speech sequence is non-stationary, replacing the first long-term prediction lag value in the corrupted frame with a third long-term prediction lag value o determined based on the second long-term prediction lag values and an adaptively-limited random lag jitter, and replacing the first long-term prediction gain value in the corrupted frame with a third long-term prediction gain value determined based on the second long- term prediction gain values and an adaptively-limited random gain jitter.
Preferably, the third long-term prediction lag value is calculated based at least partially on a weighted median of the second longrterm prediction lag values, and the adaptively-limited random lag jitter is a value bound by limits determined based on the 5 second long-term prediction lag values.
Preferably, the third long-term prediction gain value is calculated based at least partially on a weighted median of the second long-term prediction gain values, and the adaptively-limited random gain jitter is a value bound by limits determined based on the second long-term prediction gain values. 0 Alternatively, the method comprises the steps of: determining whether the corrupted frame is partially corrupted or totally corrupted; replacing the first long-term prediction lag value in the corrupted frame with a third lag value if the corrupted frame is totally corrupted, wherein when the speech 5 sequence in which the totally corrupted frame is arranged is stationary, set the third lag value equal to the last long-term prediction lag value, and when said speech sequence is non-stationary, determining the third lag value based on the second long-term prediction values and an adaptively-limited random lag jitter; and replacing the first long-term prediction lag value in the corrupted frame with a o fourth lag value if the corrupted frame is partially corrupted., wherein when the speech sequence in which the partially corrupted frame is arranged in stationary, set the fourth lag value equal to the last long-term prediction lag value, and when said speech sequence is non-stationary set the fourth lag value based on a decoded long-term prediction lag value searched from an adaptive codebook associated with the non-corrupted frame preceding 5 the corrupted frame, when said speech sequence is non-stationary.
The second aspect of the present invention is a speech signal transmitter and receiver system for encoding speech signals in an encoded bit stream and decoding the encoded bit stream into synthesized speech, wherein the encoded bit stream includes a plurality of speech frames arranged in speech sequences, and the speech frames include at o least one corrupted frame preceded by one or more non-corrupted frames, wherein the corrupted frame is indicated by a first signal and includes a first long-term prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long-term prediction lag value, and the second long-term prediction gain values include a last long-term prediction gain value, and the speech sequences include stationary and non-stationary speech 5 sequences. The system comprises: a first mechanism, responsive to the first signal, for determining whether the speech sequence in which the corrupted frame is arranged is stationary or non-stationary, based on the second long-term prediction gain values, and for providing a second signal indicative of whether the speech sequence is stationary or non-stationary; and 0 a second mechanism, responsive to the second signal, for replacing the first long- term prediction lag value in the corrupted frame with the last long-term prediction lag value when the speech sequence is stationary, and replacing the first long-term prediction lag value and the first long-term gain value in the corrupted frame with a third long-term prediction lag value and a third long-term prediction gain value, respectively, when the 5 speech sequence is non-stationary, wherein the third long-term prediction lag value is determined based on the second long-term prediction lag values and an adaptively-limited random lag jitter, and the third long-term prediction gain value is determined based on the second long-term prediction gain values and an adaptively-limited random gain jitter. Preferably, the third long-term prediction lag value is calculated based at least o partially on a weighted median of the second long-term prediction lag values, and the adaptively-limited random lag jitter is a value bound by limits determined based on the second long-term prediction lag values.
Preferably, the third long-term prediction gain value is calculated based at least partially on a weighted median of the second long-term prediction gain values, and the 5 adaptively-limited random gain jitter is a value bound by limits determined based on the second long-term prediction gain values.
The third aspect of the present invention is a decoder for synthesizing speech from an encoded bit stream, wherein the encoded bit stream includes a plurality of speech frames arranged in speech sequences, and the speech frames include at least one corrupted o frame preceded by one or more non-corrupted frames, wherein the corrupted frame is indicated by a first signal and includes a first long-term prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long-term prediction lag value and the second long-term prediction gain values include a last long-term prediction gain value and the speech sequences include stationary and non-stationary speech sequences. The decoder comprises: a first mechanism, responsive to the first signal, for determining whether the speech sequence in which the corrupted frame is arranged is stationary or non-stationary, based on the second long-term prediction gain values, and for providing a second signal indicative of whether the speech sequence is stationary or non-stationary; and a second mechanism, responsive to the second signal, for replacing the first long- term prediction lag value in the corrupted frame with the last long-term prediction lag value when the speech sequence is stationary, and replacing the first long-term prediction lag value and the first long-term gain value in the corrupted frame with a third long-term prediction lag value and a third long-term prediction gain value, respectively, when the speech sequence is non-stationary, wherein the third long-term prediction lag value is determined based on the second long-term prediction lag values and an adaptively-limited random lag jitter, and the third long-term prediction gain value is determined based on the second long-term prediction gain values and an adaptively-limited random gain jitter.
The fourth aspect of the present invention is a mobile station, which is arranged to receive an encoded bit stream containing speech data indicative of speech signals, wherein the encoded bit stream includes a plurality of speech frames arranged in speech sequences, and the speech frames include at least one corrupted frame preceded by one or more non-corrupted frames, wherein the corrupted frame is indicated by a first signal and includes a first long-term prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long-term prediction lag value and the second long-term prediction gain values include a last long-term prediction gain value and the speech sequences include stationary and non-stationary speech sequences. The mobile station comprises: a first mechanism, responsive to the first signal, for determining whether the speech sequence in which the corrupted frame is arranged is stationary or non-stationary, based on the second long-teπn prediction gain values, and for providing a second signal indicative of whether the speech sequence is stationary or non-stationary; and a second mechanism, responsive to the second signal, for replacing the first long- term prediction lag value in the corrupted frame with the last long-term prediction lag value when the speech sequence is stationary, and replacing the first long-term prediction lag value and the first long-term gain value in the corrupted frame with a third long-term prediction lag value and a third long-term prediction gain value, respectively, when the speech sequence is non-stationary, wherein the third long-term prediction lag value is determined based on the second long-term prediction lag values and an adaptively-limited random lag jitter, and the third long-term prediction gain value is determined based on the second long-term prediction gain values and an adaptively-limited random gain jitter. The fifth aspect of the present invention is an element in a telecommunication network, which is arranged to receive an encoded bit stream containing speech data from a mobile station, wherein the speech data includes a plurality of speech frames arranged in speech sequences, and the speech frames include at least one corrupted frame preceded by one or more non-corrupted frames, wherein the corrupted frame is indicated by a first signal and includes a first long-term prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long-term prediction lag value and the second long-term prediction gain values include a last long-term prediction gain value and the speech sequences include stationary and non-stationary speech sequences. The element comprises: a first mechanism, responsive to the first signal, for determining whether the speech sequence in which the corrupted frame is arranged is stationary or non-stationary, based on the second long-term prediction gain values, and for providing a second signal indicative of whether the speech sequence is stationary or non-stationary; and a second mechanism, responsive to the second signal, for replacing the first long- term prediction lag value in the corrupted frame with the last long-term prediction lag value when the speech sequence is stationary, and replacing the first long-term prediction lag value and the first long-term gain value in the corrupted frame with a third long-term prediction lag value and a third long-term prediction gain value, respectively, when the speech sequence is non-stationary, wherein the third long-term prediction lag value is determined based on the second long-term prediction lag values and an adaptively-limited random lag jitter, and the third long-term prediction gain value is determined based on the second long-term prediction gain values and an adaptively-limited random gain jitter.
The present invention will become apparent upon reading the description taken in 5 conjunction with Figures 3 to 1 lc.
Brief Description of the Drawings
Figure 1 is a block diagram illustrating a generic distributed speech codec, wherein the encoded bit stream containing speech data is conveyed from an encoder to a 0 decoder via a communication channel or a storage medium.
Figure 2 is a block diagram illustrating a prior-art error concealment apparatus in a receiver.
Figure 3 is a block diagram illustrating the error concealment apparatus in a receiver, according to the present invention. 5 Figure 4 is a flow chart illustrating the method of error concealment according to the present invention.
Figure 5 is a diagrammatic representation of a mobile station, which includes an error concealment module, according to the present invention.
Figure 6 is a diagrammatic representation of a telecommunication network using a o decoder, according to the present invention.
Figure 7 is a plot of LTP-parameters illustrating the lag and gain profiles in a voiced speech sequence.
Figure 8 is a plot of LTP-parameters illustrating the lag and gain profiles in an unvoiced speech sequence. 5 Figure 9 is a plot of LTP-lag values in a series of sub-frames illustrating the difference between the prior-art error concealment approach and the approach according to the present invention.
Figure 10 is another plot of LTP-lag values in a series of sub-frames illustrating the difference between the prior-art error concealment approach and the approach o according to the present invention.
Figure 1 la is a plot of speech signals illustrating an error-free speech sequence having the location of the bad frame of the speech channel, as shown in Figures 1 lb and lie.
Figure 1 lb is a plot of speech signals illustrating the concealment of parameters in a bad frame according to the prior art approach.
Figure lie is a plot of speech signals illustrating the concealment of parameters in a bad frame according to the present invention.
Best Mode for Carrying Out the Invention
Figure 3 illustrates a decoder 10, which includes a decoding module 20 and an error concealment module 30. The decoding module 20 receives a signal 140, which is normally indicative of speech parameters 102 for speech synthesis. The decoding module 20 is known in the art. The error concealment module 30 is arranged to receive an encoded bit stream 100, which includes a plurality of speech streams arranged in speech sequences. A bad-frame detection device 32 is used to detect corrupted frames in the speech sequences and provide a Bad-Frame-Indicator (BFI) signal 110 representing a BFI flag when a corrupted frame is detected. BFI is also known in the art. The BFI signal 110 is used to control two switches 40 and 42. Normally, the speech frames are not corrupted and the BFI flag is 0. The terminal S is operatively connected to the terminal 0 in the switches 40 and 42. The speech parameters 102 are conveyed to a buffer, or "parameter history" storage, 50 and the decoding module 20 for speech synthesis. When a bad frame is detected by the bad-frame detection device 32, the BFI flag is set to 1. The terminal S is connected to the terminal 1 in the switches 40 and 42. Accordingly, the speech parameters 102 are provided to an analyzer 70, and the speech parameters needed for speech synthesis are provided by a parameter concealment module 60 to the decoding module 20. The speech parameters 102 typically include LPC parameters for short term prediction, excitation parameters, a long-term prediction (LTP) lag parameter, an LTP gain parameter and other gain parameters. The parameter history storage 50 is used to store the LTP-lag and LTP-gain of a number of non-corrupted speech frames. The contents of the parameter history storage 50 are constantly updated so that the last LTP- gain parameter and the last LTP-lag parameter stored in the storage 50 are those of the last non-corrupted speech frame. When a corrupted frame in a speech sequence is received in the decoder 10, the BFI flag is set to 1 and the speech parameters 102 of the corrupted frame are conveyed to the analyzer 70 through the switch 40. By comparing the LTP-gain parameter in the corrupted frame and the LTP-gain parameters stored in the storage 50, it is possible for the analyzer 70 to determine whether the speech sequence is stationary or non-stationary, based on the magnitude and its variation in the LTP-gain parameters in neighboring frames. Typically, in a stationary sequence, the LTP-gain parameters are 5 high and reasonably stable, the LTP-lag value is stable and the variation in adjacent LTP- lag values is small, as shown in Figure 7. hi contrast, in an non-stationary sequence, the LTP-gain parameters are low and unstable, and the LTP-lag is also unstable, as shown in Figure 8. The LTP-lag values are changing more or less randomly. Figure 7 shows the speech sequence for the word "viiniά". Figure 8 shows the speech sequence for the word 0 "exhibition'".
If the speech sequence that includes the corrupted frame is voiced or stationary, the last good LTP-lag is retrieved from the storage 50 and conveyed to the parameter concealment module 60. The retrieved good LTP-lag is used to replace the LTP-lag of the corrupted frame. Because the LTP-lag in a stationary speech sequence is stable and its s variation is small, it is reasonable to use a previous LTP-lag with small modification to conceal the corresponding parameter in corrupted frame. Subsequently, an RX signal 104 causes the replacement parameters, as denoted by reference numeral 134, to be conveyed to the decoding module 20 through the switch 42.
If the speech sequence that includes the corrupted frame is unvoiced or non- o stationary, the analyzer 70 calculates a replacement LTP-lag value and a replacement
LTP-gain value for parameter concealment. Because LTP-lag in an non-stationary speech sequence is unstable and its variation in adjacent frames is typically very large, parameter concealment should allow the LTP-lag in an error-concealed non-stationary sequence to fluctuate in a random fashion. If the parameters in the corrupted frame are totally 5 corrupted, such as in a lost frame, the replacement LTP-lag is calculated by using a weighted median of the previous good LTP-lag values along with an adaptively-limited random jitter. The adaptively-limited random jitter is allowed to vary within limits calculated from the history of the LTP values, so that the parameter fluctuation in an error-concealed segment is similar to the previous good section of the same speech o sequence.
An exemplary rule for LTP-lag concealment is governed by a set of conditions as follows: If minGain > 0.5 AND LagDif< 10; OR lastGain > 0.5 AND secondLastGain > 0.5,
5 then the last received good LTP-lag is used for the totally corrupted frame. Otherwise, Updatejag, a weighted average of the LTP-lag buffer with randomization, is used for the totally corrupted frame. Updatejag is calculated in a manner as described below:
The LTP-lag buffer is sorted and the three biggest buffer values are retrieved. The 0 average of these three biggest values is referred to as the weighted average lag (WAV), and the difference from these biggest values is referred to as the weighted lag difference (WLD).
Let RAND be the randomization with the scale of (-WLD/2, WLD/2), then
Updatejag = WAL + RAND {-WLD/2, WLD/2), 5 wherein minGain is the smallest value of the LTP-gain buffer; LagDif 'is the difference between the smallest and the largest LTP-lag values; lastGain is the last received good LTP-gain; and o secondLastGain is the second last received good LTP-gain.
If the parameters in the corrupted frame are partially corrupted, then the LTP-lag value in the corrupted frame is replaced accordingly. That the frame is partially corrupted is determined by a set of exemplary LTP-feature criteria given below: 5
If
(1) LagDif<l0 AND (minLag- 5)< Tbf< (maxLag+ 5); OR
(2) lastGain>0.5 AND secondLastGain>0.5 AND (lastLag-10)< Tbf< (lastLag+\0); OR o (3) minGain < 0.4 AND lastGain = minGain AND minLag < Tbf< maxLag; OR
(4) LagDif < 70 AND minLag < Tbf< maxLag; OR
(5) meanLag < Tbf< maxLag is true, then Tbf is used to replace the LTP-lag in the corrupted frame. Otherwise, the corrupted frame is treated as a totally corrupted frame, as described above. In the above conditions: maxLag is the largest value of the LTP-lag buffer; meanLag is the average of the LTP-lag buffer; minLag is the smallest value of the LTP-lag buffer; lastLag is the last received good LTP-lag value; and
Tbf is a decoded LTP lag which is searched, when the BFI is set, from the adaptive codebook as if the BFI is not set.
Two examples of parameter concealment are shown in Figures 9 and 1 . As shown, the profile of the replacement LTP-lag values in the bad frame, according to the prior art, is rather flat, but the profile of the replacement, according to the present invention, allows some fluctuation, similar to the error-free profile. The difference between the prior art approach and the present invention is further illustrated in Figures 1 lb and lie, respectively, based on the speech signals in an error-free channel, as shown in Figure 11a.
When the parameters in the corrupted frame are partially corrupted, the parameter concealment can be further optimized. In partially corrupted frames, the LTP-lags in the corrupted frames may still yield an acceptable synthesized speech segment. Accordingly to the GSM specifications, the BFI flag is set by a Cyclic Redundancy Check (CRC) mechanism or other error detection mechanisms. These error detection mechanisms detect errors in the most significant bits in the channel decoding process. Accordingly, even when only a few bits are erroneous, the error can be detected and the BFI flag is set accordingly. In the prior-art parameter concealment approach, the entire frame is discarded. As a result, information contained in the correct bits is thrown away.
Typically, in the channel decoding process, the BER per frame is a good indicator for the channel condition. When the channel condition is good, the BER per frame is small and a high percentage of the LTP-lag values in the erroneous frames are correct. For example, when the frame error rate (FER) is 0.2%, over 70% of the LTP-lag values are correct. Even when the FER reaches 3%, about 60% of the LTP-lag values are still correct. The CRC can accurately detect a bad frame and set the BFI flag accordingly. However, the CRC does not provide an estimation of the BER in the frame. If the BFI flag is used as the only criterion for parameter concealment, then a high percentage of the correct LTP-lag values could be wasted, h order to prevent a large amount of correct LTP-lags from being thrown away, it is possible to adapt a decision criterion for 5 parameter concealment based on the LTP history. It is also possible to use the FER, for example, as the decision criterion. If the LTP-lag meets the decision criterion, no parameter concealment is necessary. In that case, the analyzer 70 conveys the speech parameters 102, as received through the switch 40, to the parameter concealment module 60 which then conveys the same to the decoding module 20 through the switch 42. If the 0 LTP-lag does not meet that decision criterion, then the corrupted frame is further examined using the LTP-feature criteria, as described hereinabove, for parameter concealment.
In stationary speech sequences, the LTP-lag is very stable. Whether most of the LTP-lag values in a corrupted frame are correct or erroneous can be correctly predicted 5 with high probability. Thus, it is possible to adapt a very strict criterion for parameter concealment. In non-stationary speech sequences, it may be difficult to predict whether the LTP-lag value in a corrupted frame is correct, because of the unstable nature of the LTP parameters. However, that the prediction is correct or wrong is less important in non-stationary speech than in stationary speech. While allowing erroneous LTP-lag o values to be used in decoding stationary speech may cause the synthesized speech to be unrecognizable, allowing erroneous LTP-lag values to be used in decoding non-stationary speech usually only increases the audible artifacts. Thus, the decision criterion for parameter concealment in non-stationary speech can be relatively lax.
As mentioned earlier, the LTP-gain fluctuates greatly in non-stationary speech. If 5 the same LTP-gain value from the last good frame is used repeatedly to replace the LTP- gain value of one or more corrupted frames in a speech sequence, the LTP-gain profile in the gain concealed segment will be flat (similar to the prior-art LTP-lag replacement, as shown in Figures 7 and 8), in stark contrast to the fluctuating profile of the non-corrupted frames. The sudden change in the LTP-gain profile may cause unpleasant audible o artifacts. In order to minimize these audible artifacts, it is possible to allow the replacement LTP-gain value to fluctuate in the error-concealed segment. For this purpose, the analyzer 70 can be also used to determine the limits between which the replacement LTP-gain value is allowed to fluctuate based on the gain values in the LTP history.
LTP-gain concealment can be carried out in a manner as described below. When the BFI is set, a replacement LTP-gain value is calculated according to a set of LTP-gain concealment rules. The replacement LTP-gain is denoted as Updated_gain.
(1) If gainDif>0.5 AND lastGain = maxGain > 0.9 AND subBF=\, then Updated_gain = (secondLastGain + thirdLastGain)/2;
(2) If gainDif> .5 AND lastGain = maxGain > 0.9 AND subBF=2, then Updatedjgain = meanGain + randVar* (maxGain - meanGain);
(3) If gainDif>0.5 AND lastGain = maxGain > 0.9 AND subBF=3, then Updatedjgain = meanGain — randVar* (meanGain - minGain);
(4) If gainDif>0.5 AND lastGain = maxGain > 0.9 AND subBF=4, then Updatedjgain = meanGain + randVar* (maxGain - meanGain);
In the previous conditions, Updatedjgain cannot be larger than lastGain. If the previous conditions cannot be met, the following conditions are used:
(5) If gainDif> 0.5, then Updatedjgain = lastGain; (6) If gainDif< 0.5 AND lastGain = maxGain, then Updatedjgain = meanGain; (7) If gainDIF < 0.5, then
Updatedjgain = lastGain,
Wherein meanGain is the average of the LTP-gain buffer; maxGain is the largest value of the LTP-gain buffer; minGain is the smallest value of the LTP-gain buffer; randVar is a random value between 0 and 1, gainDIF is the difference between the smallest and the largest LTP-gain values in the LTP-gain buffer; lastGain is the last received good LTP-gain; seconLastGain is the second last received good LTP-gain; thirdLastGain is the third last received good LTP-gain; and subBF is the order of the subframe.
5 Figure 4 illustrates the method of error-concealment, according to the present invention. As the encoded bit stream is received at step 160, the frame is checked to see if it is corrupted at step 162. If the frame is not corrupted, then the parameter history of the speech sequence is updated at step 164, and the speech parameters of the current frame are decoded at step 166. The procedure then goes back to step 162. If the frame is 0 bad or corrupted, the parameters are retrieved from the parameter history storage at step 170. Whether the corrupted frame is part of the stationary speech sequence or non- stationary speech sequence is determined at step 172. If the speech sequence is stationary, the LTP-lag of the last good frame is used to replace the LTP-lag in the corrupted frame at step 174. If the speech sequence is non-stationary, a new lag value and new gain value 5 are calculated based on the LTP history at step 180, and they are used to replace the corresponding parameters in the corrupted frame at step 182.
Figure 5 shows a block diagram of a mobile station 200 according to one exemplary embodiment of the invention. The mobile station comprises parts typical of the device, such as a microphone 201, keypad 207, display 206, earphone 214, o transmit/receive switch 208, antenna 209 and control unit 205. In addition, the figure shows transmitter and receiver blocks 204, 211 typical of a mobile station. The transmitter block 204 comprises a coder 221 for coding the speech signal. The transmitter block 204 also comprises operations required for channel coding, deciphering and modulation as well as RF functions, which have not been drawn in Figure 5 for clarity. 5 The receiver block 211 also comprises a decoding block 220 according to the invention. Decoding block 220 comprises an error concealment module 222 like the parameter concealment module 30 shown in Figure 3. The signal coming from the microphone 201, amplified at the amplification stage 202 and digitized in the A/D converter, is taken to the transmitter block 204, typically to the speech coding device comprised by the transmit o block. The transmission signal, which is processed, modulated and amplified by the transmit block, is taken via the transmit/receive switch 208 to the antenna 209. The signal to be received is taken from the antenna via the transmit/receive switch 208 to the receiver block 211, which demodulates the received signal and decodes the deciphering and the channel coding. The resulting speech signal is taken via the D/A converter 212 to an amplifier 213 and further to an earphone 214. The control unit 205 controls the operation of the mobile station 200, reads the control commands given by the user from the keypad 5 207 and gives messages to the user by means of the display 206.
The parameter concealment module 30, according to the invention, can also be used in a telecommunication network 300, such as an ordinary telephone network, or a mobile station network, such as the GSM network. Figure 6 shows an example of a block diagram of such a telecommunication network. For example, the telecommunication 0 network 300 can comprise telephone exchanges or corresponding switching systems 360, to which ordinary telephones 370, base stations 340, base station controllers 350 and other central devices 355 of telecommunication networks are coupled. Mobile stations 330 can establish connection to the telecommunication network via the base stations 340. A decoding block 320, which includes an error concealment module 322 similar to the error 5 concealment module 30 shown in Figure 3, can be particularly advantageously placed in the base station 340, for example. However, the decoding block 320 can also be placed in the base station controller 350 or other central or switching device 355, for example. If the mobile station system uses separate transcoders, for example, between the base stations and the base station controllers, for transforming the coded signal taken over the o radio channel into a typical 64 kbit/s signal transferred in a telecommunication system and vice versa, the decoding block 320 can also be placed in such a transcoder. In general, the decoding block 320, including the parameter concealment module 322, can be placed in any element of the telecommunication network 300, which transforms the coded data stream into an uncoded data stream. The decoding block 320 decodes and filters the 5 coded speech signal coming from the mobile station 330, whereafter the speech signal can be transferred in the usual manner as uncompressed forward in the telecommunication network 300.
It should be noted that the error concealment method of the present invention has been described with respect to stationary and non-stationary speech sequences, and that o stationary speech sequences are usually voiced and non-stationary speech sequences are usually unvoiced. Thus, it will be understood that the disclosed method is applicable to error concealment in voiced and unvoiced speech sequences. The present invention is applicable to CELP type speech codecs and can be adapted to other types of speech codecs as well. Thus, although the invention has been described with respect to a preferred embodiment thereof, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the spirit and scope of this invention.

Claims

What is claimed is:
1. A method for concealing errors in an encoded bit stream indicative of speech signals received in a speech decoder, wherein the encoded bit stream includes a plurality of speech frames arranged in speech sequences, and the speech frames include at least one
5 partially corrupted frame preceded by one or more non-corrupted frames, wherein the partially corrupted frame includes a first long-term prediction lag value and a first long- term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long-term prediction lag value, and 0 the second long-term prediction gain values include a last long-term prediction gain value, said method comprising the steps of: providing an upper limit and a lower limit based on the second long-term prediction lag values; determining whether the first long-term prediction lag value is within or outside 5 the upper and lower limits; replacing the first long-term prediction lag value in the partially corrupted frame with a third lag value, when the first long-term prediction lag value is outside the upper and lower limits; and retaining the first long-term prediction lag value in the partially corrupted frame o when the first long-term prediction lag value is within the upper and lower limits.
2. The method of claim 1, further comprising the step of replacing the first long-term prediction gain value in the partially corrupted frame with a third gain value, when the first long-term lag value is outside the upper and lower limits. 5
3. The method of claim 1, wherein the third lag value is calculated based the second long-term prediction lag values and an adaptively-limited random lag jitter bound by further limits determined based on the second long-term prediction lag values.
0 4. The method of claim 2, wherein the third gain value is calculated based on of the second long-term prediction gain values and an adaptively-limited random gain jitter bound by limits determined based on the second long-term prediction gain values.
5. A method for concealing errors in an encoded bit stream indicative of speech signals received in a speech decoder, wherein the encoded bit stream includes a plurality of speech frames arranged in speech sequences, and the speech frames include at least one corrupted frame preceded by one or more non-corrupted frames, wherein the corrupted frame includes a first long-term prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long-term prediction lag value, and the second long-term prediction gain values include a last long-term prediction gain value and the speech sequences include stationary and non-stationary speech sequences, and wherein the corrupted frame can be a totally corrupted frame or a partially corrupted frame, said method comprising the steps of: determining whether the corrupted frame is partially corrupted or totally corrupted; replacing the first long-term prediction lag value in the corrupted frame with a third lag value if the corrupted frame is totally corrupted; and replacing the first long-term prediction lag value in the corrupted frame with a fourth lag value if the corrupted frame is partially corrupted.
6. The method of claim 5, further comprising the steps of: determining whether the speech sequence in which the partially corrupted frame is arranged is stationary or non-stationary; setting the fourth lag value equal to the last long-term prediction lag value, when said speech sequence is stationary ; and determining the fourth lag value based on a decoded long-term prediction lag value searched from an adaptive codebook associated with the non-corrupted frame preceding the corrupted frame, when said speech sequence is non-stationary.
7. The method of claim 5, further comprising the steps of: determining whether the speech sequence in which the totally corrupted frame is arranged is stationary or non-stationary; setting the third lag value equal to the last long-term prediction lag value, when said speech sequence is stationary; and determining the third lag value based on the second long-term prediction values and an adaptively-limited random lag jitter, when said speech sequence is non-stationary.
5
8. The method of claim 6, wherein the second long-term prediction lag values further include a second last long-term prediction lag value and a third last long-term prediction lag value, and the second long-term prediction gain values further include a second last long-term prediction gain value and a third last long-term prediction gain value, said 0 method further comprising the steps of: determining minLag, which is the smallest lag value among the second long-term prediction lag values; determining maxLag, which is the largest lag value among the second long-term prediction lag values; 5 determining meanLag, which is an average of the second long-term prediction lag values; determimng difLag, which is the difference of maxLag and minLag; determining minGain, which is the smallest gain value among the second long- term prediction gain values; o determimng maxGain, which is the largest gain value among the second long-term prediction gain values; and determining meanGain, which is an average of the second long term gain values; wherein if difLag < 10, and (minLag - 5) < the fourth lag value < (maxLag + 5); or 5 if the last long-term prediction gain value is larger than 0.5, and the second last long-term prediction gain value is larger than 0.5, and the fourth lag value is smaller than a sum of the last long-term prediction value and 10, and a sum of the fourth lag value and 10 is larger than the last long-term prediction value; or if minGain < 0.4, and the last long-term prediction gain value is equal to minGain, o and the fourth lag value is larger than minLag but smaller than maxLag; or if difLag < 70, and the fourth lag value is larger than minLag but smaller than maxLag; or if the fourth lag value is larger than meanLag but smaller than maxLag; then the corrupted frame is determined as partially corrupted.
9. The method of claim 6, wherein when said speech sequence is non-stationary, said 5 method further comprising the step of determining a frame-error rate of the speech frames such that if the frame-error rate reaches a determined value, the fourth lag value is determined based on said decoded long-term prediction lag value, and if the frame-error rate is smaller than the determined value, the fourth lag value is 0 set equal to the last long-term prediction lag value.
10. The method of claim 5, wherein the stationary speech sequences include voiced sequences, and the non-stationary speech sequences include unvoiced sequences.
s 11. A speech signal transmitter and receiver system for encoding speech signals in an encoded bit stream and decoding the encoded bit stream into synthesized speech, wherein the encoded bit stream includes a plurality of speech frames arranged in speech sequences, and the speech frames include at least one corrupted frame preceded by one or more non-corrupted frames, wherein the corrupted frame includes frame a first long-term o prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long- term prediction lag value and the second long-term prediction gain values include a last long-term prediction gain value, and the speech sequences include stationary and non- 5 stationary speech sequences, and a first signal is used to indicate the corrupted frame, said system comprising: a first means, responsive to the first signal, for determining whether the speech sequence in which the corrupted frame is arranged is stationary or non-stationary, and for providing a second signal indicative of said determining; 0 a second means, responsive to the second signal, for replacing the first long-term prediction lag value in the corrupted frame with the last long-term prediction lag value when said speech sequence is stationary, and replacing the first long-term prediction lag value in the corrupted frame with a third lag value when said speech sequence is non- stationary.
12. The system of claim 11, wherein the third lag value is determined based on the second long-term prediction lag values and an adaptively-limited random lag jitter.
13. The system of claim 11 , wherein the second means further replaces the first long- term prediction gain value in the corrupted frame with a third gain value when said speech sequence is non-stationary. 0
14. The system of claim 13, wherein the third gain value is determined based on the second long-term prediction gain values and an adaptively-limited random gain jitter.
15. The system of claim 11, wherein the stationary speech sequences include voiced 5 sequences, and the non-stationary speech sequences include unvoiced sequences.
16. A decoder for synthesizing speech from an encoded bit stream, wherein the encoded bit stream includes a plurality of speech frames arranged in speech sequences, and the speech frames include at least one corrupted frame preceded by one or more non- o corrupted frames, wherein the corrupted frame includes a first long-term prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long-term prediction lag value and the second long-term prediction gain values include a last long-term prediction 5 gain value and the speech sequences include stationary and non-stationary speech sequences, and a first signal is used to indicate the corrupted frame, said decoder comprising: a first means, responsive to the first signal, for determining whether the speech sequence in which the corrupted frame is arranged is stationary or non-stationary, and for o providing a second signal indicative of said determining; a second means, responsive to the second signal, for replacing the first long-term prediction lag value in the corrupted frame with the last long-term prediction lag value when said speech sequence is stationary, and replacing the first long-term prediction lag value in the corrupted frame with a third lag value when said speech sequence is non- stationary.
17. The decoder of claim 16, wherein the lag value is determined based on the second 5 long-term prediction lag values and an adaptively-limited random lag jitter.
18. The decoder of claim 16, wherein the second means further replaces the first long- term gain value in the corrupted frame with a third gain value when said speech sequence is non-stationary. 0
19. The decoder of claim 18, wherein the third gain value is determined based on the second long-term prediction gain values and an adaptively-limited random gain jitter.
20. The decoder of claim 16, wherein the stationary speech sequences include voiced 5 sequences, and the non-stationary speech sequences include unvoiced sequences.
21. A mobile station, which is arranged to receive an encoded bit stream containing speech data indicative of speech signals, wherein the encoded bit stream includes a plurality of speech frames arranged in speech sequences, and the speech frames include at o least one corrupted frame preceded by one or more non-corrupted frames, wherein the corrupted frame includes a first long-term prediction lag value and a first long-term prediction gain value, and the non-corrupted frames include second long-term prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long-term prediction lag value and the second long- 5 term prediction gain values include a last long-term prediction gain value and the speech sequences include stationary and non-stationary speech sequences, and wherein a first signal is used to indicate the corrupted frame, said mobile station comprising: a first means, responsive to the first signal, for determining whether the speech sequence in which the corrupted frame is arranged is stationary or non-stationary, and for o providing a second signal indicative of said determining; and a second means, responsive to the second signal, for replacing the first long-term prediction lag value in the corrupted frame with the last long-term prediction lag value when said speech sequence is stationary, and replacing the first long-term prediction lag value in the corrupted frame with a third lag value when said speech sequence is non- stationary.
5 22. The mobile station of claim 21, wherein the third lag value is determined based on the second long-term prediction lag values and an adaptively-limited random lag jitter.
23. The mobile station of claim 21 , wherein the second means further replaces the first long-term gain value in the corrupted frame with a third gain value when said speech 0 sequence is non-stationary.
24. The mobile station of claim 23, wherein the third gain value is determined based on the second long-term prediction gain values and an adaptively-limited random gain jitter. 5
25. The mobile station of claim 21 , wherein the stationary speech sequences include voiced sequences, and the non-stationary speech sequences include unvoiced sequences.
26. An element in a telecommunication network, which is arranged to receive an o encoded bit stream containing speech data from a mobile station, wherein the speech data includes a plurality of speech frames arranged in speech sequences, and the speech frames include at least one corrupted frame preceded by one or more non-corrupted frames, wherein the corrupted frame includes a first long-term prediction lag value and a first . long-term prediction gain value, and the non-corrupted frames include second long-term 5 prediction lag values and second long-term prediction gain values, and wherein the second long-term prediction lag values include a last long-term prediction lag value and the second long-term prediction gain values include a last long-term prediction gain value and the speech sequences include stationary and non-stationary speech sequences, and wherein a first signal is used to indicate the corrupted frame, said element comprising: 0 a first means, responsive to the first signal, for determining whether the speech sequence in which the corrupted frame is arranged is stationary or non-stationary, and for providing a second signal indicative of said determining; and a second means, responsive to the second signal, for replacing the first long-term prediction lag value in the corrupted frame with the last long-term prediction lag value when said speech sequence is stationary, and replacing the first long-term prediction lag value in the corrupted frame with a third lag value when said speech sequence is non- stationary.
27. The element of claim 26, wherein the third long-term prediction lag value is determined based on the second long-term prediction lag values and an adaptively-limited random lag jitter.
28. The element of claim 26, wherein the third means further replaces the first long- term prediction gain value with a third gain value when said speech sequence is non- stationary.
29. The element of claim 28, wherein the third gain value is determined based on the second long-term prediction gain values and an adaptively-limited random gain jitter.
30. The element of claim 26, wherein the stationary speech sequences include voiced sequences, and the non-stationary speech sequences include unvoiced sequences.
31. (New) The method of claim 5, wherein the second long-term prediction gain values further include a second last long-term prediction gain value, and
if difLag < 10, and (minLag - 5) < decodedLag < (maxLag + 5); or if lastGain > 0.5, and secondlastGain > 0.5, and
(lastLag -10) < decodedLag < (lastLag + 10); or if minGain < 0.4, and lastGain > 0.5, and minLag < decodedLag < maxLag; or if difLag < 70, and minLag < decodedLag < maxLag; or iϊmeanLag < decodedLag < maxLag,
then the fourth value is set equal to the decodedLag, wherein minLag is a smallest lag value among the second long-term prediction lag values, maxLag is a largest lag value among the second long-term prediction lag values, meanLag is an average of the second long-term prediction lag values, difLag is a difference of maxLag and minLag, minGain is a smallest gain value among the second long-term prediction gain values, meanGain an average of the second long-term prediction gain values, lastGain is the last long-term prediction gain value, lastLag is the last long-term prediction lag value, secondlastGain is the second last long-term prediction lag value, and decodedLag is a decoded long-term prediction lag which is searched from an adaptive codebook associated with the non-corrupted frame preceding the corrupted frame.
32. (New) The method of claim 8, wherein the first long-term prediction gain value is replaced by Updatedjgain, and wherein
If gainDif>0.5 AND lastGain = maxGain > 0.9 AND subBF=l, then Updatedjgain = (secondLastGain + thirdLastGain)/2; If gainDif 0.5 AND lastGain = maxGain > 0.9 AND subBF=2, then
Updatedjgain = meanGain + randVar* (maxGain - meanGain); If gainDif >0.5 AND lastGain = maxGain > 0.9 AND subBF=3, then
Updatedjgain = meanGain — randVar* (meanGain - minGain); If gainDif >0.5 AND lastGain = maxGain > 0.9 AND subBF=4, then Updatedjgain = meanGain + randVar* (maxGain - meanGain); and when Updatedjgain is equal to or smaller than lastGain ;
or
If gainDif > 0.5, then
Updatedjgain = lastGain; (8) If gainDif < 0.5 AND lastGain = maxGain, then Updatedjgain = meanGain; (9) If gainDIF < 0.5, then
Updatedjgain = lastGain, and when Updatedjgain is larger than lastGain,
wherein randVar is a random value between 0 and 1, gainDIF is the difference between a smallest and a largest long-term prediction gain value; lastGain is the last long-term prediction gain value; secondLastGain is the second last long-term prediction gain value; thirdLastGain is the third last long-term prediction gain value; and subBF is an order of the subframe.
EP01983716A 2000-10-31 2001-10-29 Method and system for speech frame error concealment in speech decoding Expired - Lifetime EP1330818B1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US702540 2000-10-31
US09/702,540 US6968309B1 (en) 2000-10-31 2000-10-31 Method and system for speech frame error concealment in speech decoding
PCT/IB2001/002021 WO2002037475A1 (en) 2000-10-31 2001-10-29 Method and system for speech frame error concealment in speech decoding

Publications (2)

Publication Number Publication Date
EP1330818A1 true EP1330818A1 (en) 2003-07-30
EP1330818B1 EP1330818B1 (en) 2006-06-28

Family

ID=24821628

Family Applications (1)

Application Number Title Priority Date Filing Date
EP01983716A Expired - Lifetime EP1330818B1 (en) 2000-10-31 2001-10-29 Method and system for speech frame error concealment in speech decoding

Country Status (14)

Country Link
US (1) US6968309B1 (en)
EP (1) EP1330818B1 (en)
JP (1) JP4313570B2 (en)
KR (1) KR100563293B1 (en)
CN (1) CN1218295C (en)
AT (1) ATE332002T1 (en)
AU (1) AU2002215138A1 (en)
BR (2) BRPI0115057B1 (en)
CA (1) CA2424202C (en)
DE (1) DE60121201T2 (en)
ES (1) ES2266281T3 (en)
PT (1) PT1330818E (en)
WO (1) WO2002037475A1 (en)
ZA (1) ZA200302556B (en)

Families Citing this family (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7821953B2 (en) * 2005-05-13 2010-10-26 Yahoo! Inc. Dynamically selecting CODECS for managing an audio message
US6885988B2 (en) * 2001-08-17 2005-04-26 Broadcom Corporation Bit error concealment methods for speech coding
US20050229046A1 (en) * 2002-08-02 2005-10-13 Matthias Marke Evaluation of received useful information by the detection of error concealment
US7634399B2 (en) * 2003-01-30 2009-12-15 Digital Voice Systems, Inc. Voice transcoder
GB2398982B (en) * 2003-02-27 2005-05-18 Motorola Inc Speech communication unit and method for synthesising speech therein
US7610190B2 (en) * 2003-10-15 2009-10-27 Fuji Xerox Co., Ltd. Systems and methods for hybrid text summarization
US7668712B2 (en) * 2004-03-31 2010-02-23 Microsoft Corporation Audio encoding and decoding with intra frames and adaptive forward error correction
US7409338B1 (en) * 2004-11-10 2008-08-05 Mediatek Incorporation Softbit speech decoder and related method for performing speech loss concealment
EP1846921B1 (en) * 2005-01-31 2017-10-04 Skype Method for concatenating frames in communication system
JP4846712B2 (en) * 2005-03-14 2011-12-28 パナソニック株式会社 Scalable decoding apparatus and scalable decoding method
US7831421B2 (en) 2005-05-31 2010-11-09 Microsoft Corporation Robust decoder
US7177804B2 (en) 2005-05-31 2007-02-13 Microsoft Corporation Sub-band voice codec with multi-stage codebooks and redundant coding
US7707034B2 (en) * 2005-05-31 2010-04-27 Microsoft Corporation Audio codec post-filter
US8160874B2 (en) * 2005-12-27 2012-04-17 Panasonic Corporation Speech frame loss compensation using non-cyclic-pulse-suppressed version of previous frame excitation as synthesis filter source
KR100900438B1 (en) * 2006-04-25 2009-06-01 삼성전자주식회사 Apparatus and method for voice packet recovery
KR100862662B1 (en) * 2006-11-28 2008-10-10 삼성전자주식회사 Method and Apparatus of Frame Error Concealment, Method and Apparatus of Decoding Audio using it
CN100578618C (en) * 2006-12-04 2010-01-06 华为技术有限公司 Decoding method and device
CN101226744B (en) * 2007-01-19 2011-04-13 华为技术有限公司 Method and device for implementing voice decode in voice decoder
KR20080075050A (en) * 2007-02-10 2008-08-14 삼성전자주식회사 Method and apparatus for updating parameter of error frame
GB0703795D0 (en) * 2007-02-27 2007-04-04 Sepura Ltd Speech encoding and decoding in communications systems
US8165224B2 (en) 2007-03-22 2012-04-24 Research In Motion Limited Device and method for improved lost frame concealment
WO2008143871A1 (en) * 2007-05-15 2008-11-27 Radioframe Networks, Inc. Transporting gsm packets over a discontinuous ip based network
CN101743586B (en) * 2007-06-11 2012-10-17 弗劳恩霍夫应用研究促进协会 Audio encoder, encoding methods, decoder, decoding method, and encoded audio signal
CN100524462C (en) 2007-09-15 2009-08-05 华为技术有限公司 Method and apparatus for concealing frame error of high belt signal
KR101525617B1 (en) * 2007-12-10 2015-06-04 한국전자통신연구원 Apparatus and method for transmitting and receiving streaming data using multiple path
US20090180531A1 (en) * 2008-01-07 2009-07-16 Radlive Ltd. codec with plc capabilities
WO2009152124A1 (en) * 2008-06-10 2009-12-17 Dolby Laboratories Licensing Corporation Concealing audio artifacts
KR101622950B1 (en) * 2009-01-28 2016-05-23 삼성전자주식회사 Method of coding/decoding audio signal and apparatus for enabling the method
US10218327B2 (en) * 2011-01-10 2019-02-26 Zhinian Jing Dynamic enhancement of audio (DAE) in headset systems
TWI626644B (en) * 2012-06-08 2018-06-11 三星電子股份有限公司 Frame error concealment device
US9830920B2 (en) 2012-08-19 2017-11-28 The Regents Of The University Of California Method and apparatus for polyphonic audio signal prediction in coding and networking systems
US9406307B2 (en) * 2012-08-19 2016-08-02 The Regents Of The University Of California Method and apparatus for polyphonic audio signal prediction in coding and networking systems
RU2640743C1 (en) * 2012-11-15 2018-01-11 Нтт Докомо, Инк. Audio encoding device, audio encoding method, audio encoding programme, audio decoding device, audio decoding method and audio decoding programme
EP2922056A1 (en) 2014-03-19 2015-09-23 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus, method and corresponding computer program for generating an error concealment signal using power compensation
EP2922054A1 (en) 2014-03-19 2015-09-23 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus, method and corresponding computer program for generating an error concealment signal using an adaptive noise estimation
EP2922055A1 (en) 2014-03-19 2015-09-23 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus, method and corresponding computer program for generating an error concealment signal using individual replacement LPC representations for individual codebook information
JP7266689B2 (en) * 2019-01-13 2023-04-28 華為技術有限公司 High resolution audio encoding

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5699485A (en) * 1995-06-07 1997-12-16 Lucent Technologies Inc. Pitch delay modification during frame erasures
US6188980B1 (en) * 1998-08-24 2001-02-13 Conexant Systems, Inc. Synchronized encoder-decoder frame concealment using speech coding parameters including line spectral frequencies and filter coefficients
US6453287B1 (en) * 1999-02-04 2002-09-17 Georgia-Tech Research Corporation Apparatus and quality enhancement algorithm for mixed excitation linear predictive (MELP) and other speech coders
US6377915B1 (en) * 1999-03-17 2002-04-23 Yrp Advanced Mobile Communication Systems Research Laboratories Co., Ltd. Speech decoding using mix ratio table
US7031926B2 (en) * 2000-10-23 2006-04-18 Nokia Corporation Spectral parameter substitution for the frame error concealment in a speech decoder

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0237475A1 *

Also Published As

Publication number Publication date
US6968309B1 (en) 2005-11-22
AU2002215138A1 (en) 2002-05-15
EP1330818B1 (en) 2006-06-28
DE60121201T2 (en) 2007-05-31
KR100563293B1 (en) 2006-03-22
CN1489762A (en) 2004-04-14
PT1330818E (en) 2006-11-30
WO2002037475A1 (en) 2002-05-10
CN1218295C (en) 2005-09-07
CA2424202A1 (en) 2002-05-10
ZA200302556B (en) 2004-04-05
ATE332002T1 (en) 2006-07-15
ES2266281T3 (en) 2007-03-01
KR20030086577A (en) 2003-11-10
BRPI0115057B1 (en) 2018-09-18
BR0115057A (en) 2004-06-15
DE60121201D1 (en) 2006-08-10
JP2004526173A (en) 2004-08-26
CA2424202C (en) 2009-05-19
JP4313570B2 (en) 2009-08-12

Similar Documents

Publication Publication Date Title
US6968309B1 (en) Method and system for speech frame error concealment in speech decoding
KR100718712B1 (en) Decoding device and method, and medium for providing a program
US6230124B1 (en) Coding method and apparatus, and decoding method and apparatus
EP0848374A2 (en) A method and a device for speech encoding
US10607624B2 (en) Signal codec device and method in communication system
EP1224663B1 (en) A predictive speech coder using coding scheme selection patterns to reduce sensitivity to frame errors
EP2798631B1 (en) Adaptively encoding pitch lag for voiced speech
EP1617417A1 (en) Voice coding/decoding method and apparatus
JP3464371B2 (en) Improved method of generating comfort noise during discontinuous transmission
EP1020848A2 (en) Method for transmitting auxiliary information in a vocoder stream
JPH1022937A (en) Error compensation device and recording medium
US7584096B2 (en) Method and apparatus for encoding speech
JP4437052B2 (en) Speech decoding apparatus and speech decoding method
JP4597360B2 (en) Speech decoding apparatus and speech decoding method
JP3519764B2 (en) Speech coding communication system and its device
KR20010113780A (en) Error correction method with pitch change detection
EP1527440A1 (en) Speech communication unit and method for error mitigation of speech frames
AU2002210799B2 (en) Improved spectral parameter substitution for the frame error concealment in a speech decoder
KR20100112128A (en) Processing of binary errors in a digital audio binary frame

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20030402

AK Designated contracting states

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

AX Request for extension of the european patent

Extension state: AL LT LV MK RO SI

17Q First examination report despatched

Effective date: 20050324

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: AT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20060628

Ref country code: IT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRE;WARNING: LAPSES OF ITALIAN PATENTS WITH EFFECTIVE DATE BEFORE 2007 MAY HAVE OCCURRED AT ANY TIME BEFORE 2007. THE CORRECT EFFECTIVE DATE MAY BE DIFFERENT FROM THE ONE RECORDED.SCRIBED TIME-LIMIT

Effective date: 20060628

Ref country code: FI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20060628

Ref country code: BE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20060628

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

REG Reference to a national code

Ref country code: CH

Ref legal event code: NV

Representative=s name: E. BLUM & CO. PATENTANWAELTE

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

REF Corresponds to:

Ref document number: 60121201

Country of ref document: DE

Date of ref document: 20060810

Kind code of ref document: P

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: DK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20060928

REG Reference to a national code

Ref country code: SE

Ref legal event code: TRGR

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20061031

Ref country code: MC

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20061031

REG Reference to a national code

Ref country code: PT

Ref legal event code: SC4A

Free format text: AVAILABILITY OF NATIONAL TRANSLATION

Effective date: 20060927

ET Fr: translation filed
REG Reference to a national code

Ref country code: ES

Ref legal event code: FG2A

Ref document number: 2266281

Country of ref document: ES

Kind code of ref document: T3

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed

Effective date: 20070329

REG Reference to a national code

Ref country code: CH

Ref legal event code: PFA

Owner name: NOKIA CORPORATION

Free format text: NOKIA CORPORATION#KEILALAHDENTIE 4#02150 ESPOO (FI) -TRANSFER TO- NOKIA CORPORATION#KEILALAHDENTIE 4#02150 ESPOO (FI)

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20060929

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LU

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20061029

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: CY

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20060628

REG Reference to a national code

Ref country code: GB

Ref legal event code: 732E

Free format text: REGISTERED BETWEEN 20150910 AND 20150916

REG Reference to a national code

Ref country code: DE

Ref legal event code: R082

Ref document number: 60121201

Country of ref document: DE

Representative=s name: BECKER, KURIG, STRAUS, DE

Ref country code: DE

Ref legal event code: R081

Ref document number: 60121201

Country of ref document: DE

Owner name: NOKIA TECHNOLOGIES OY, FI

Free format text: FORMER OWNER: NOKIA CORP., 02610 ESPOO, FI

REG Reference to a national code

Ref country code: ES

Ref legal event code: PC2A

Owner name: NOKIA TECHNOLOGIES OY

Effective date: 20151124

REG Reference to a national code

Ref country code: PT

Ref legal event code: PC4A

Owner name: NOKIA TECHNOLOGIES OY, FI

Effective date: 20151127

REG Reference to a national code

Ref country code: CH

Ref legal event code: PUE

Owner name: NOKIA TECHNOLOGIES OY, FI

Free format text: FORMER OWNER: NOKIA CORPORATION, FI

REG Reference to a national code

Ref country code: NL

Ref legal event code: PD

Owner name: NOKIA TECHNOLOGIES OY; FI

Free format text: DETAILS ASSIGNMENT: VERANDERING VAN EIGENAAR(S), OVERDRACHT; FORMER OWNER NAME: NOKIA CORPORATION

Effective date: 20151111

REG Reference to a national code

Ref country code: FR

Ref legal event code: PLFP

Year of fee payment: 16

REG Reference to a national code

Ref country code: FR

Ref legal event code: TP

Owner name: NOKIA TECHNOLOGIES OY, FI

Effective date: 20170109

REG Reference to a national code

Ref country code: FR

Ref legal event code: PLFP

Year of fee payment: 17

REG Reference to a national code

Ref country code: FR

Ref legal event code: PLFP

Year of fee payment: 18

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20200914

Year of fee payment: 20

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: TR

Payment date: 20201022

Year of fee payment: 20

Ref country code: NL

Payment date: 20201015

Year of fee payment: 20

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: PT

Payment date: 20201029

Year of fee payment: 20

Ref country code: DE

Payment date: 20201013

Year of fee payment: 20

Ref country code: IT

Payment date: 20200911

Year of fee payment: 20

Ref country code: CH

Payment date: 20201015

Year of fee payment: 20

Ref country code: ES

Payment date: 20201103

Year of fee payment: 20

Ref country code: GB

Payment date: 20201021

Year of fee payment: 20

Ref country code: SE

Payment date: 20201012

Year of fee payment: 20

REG Reference to a national code

Ref country code: CH

Ref legal event code: PL

Ref country code: DE

Ref legal event code: R071

Ref document number: 60121201

Country of ref document: DE

REG Reference to a national code

Ref country code: NL

Ref legal event code: MK

Effective date: 20211028

REG Reference to a national code

Ref country code: GB

Ref legal event code: PE20

Expiry date: 20211028

REG Reference to a national code

Ref country code: SE

Ref legal event code: EUG

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF EXPIRATION OF PROTECTION

Effective date: 20211028

Ref country code: PT

Free format text: LAPSE BECAUSE OF EXPIRATION OF PROTECTION

Effective date: 20211108

REG Reference to a national code

Ref country code: ES

Ref legal event code: FD2A

Effective date: 20220204

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: ES

Free format text: LAPSE BECAUSE OF EXPIRATION OF PROTECTION

Effective date: 20211030

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: TR

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20211029