WO2008049221A1 - Method and device for coding transition frames in speech signals - Google Patents

Method and device for coding transition frames in speech signals Download PDF

Info

Publication number
WO2008049221A1
WO2008049221A1 PCT/CA2007/001896 CA2007001896W WO2008049221A1 WO 2008049221 A1 WO2008049221 A1 WO 2008049221A1 CA 2007001896 W CA2007001896 W CA 2007001896W WO 2008049221 A1 WO2008049221 A1 WO 2008049221A1
Authority
WO
WIPO (PCT)
Prior art keywords
transition
codebook
frame
glottal
transition mode
Prior art date
Application number
PCT/CA2007/001896
Other languages
French (fr)
Inventor
Vaclav Eksler
Milan Jelinek
Redwan Salami
Original Assignee
Voiceage Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=39324068&utm_source=google_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=WO2008049221(A1) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Priority to MX2009004427A priority Critical patent/MX2009004427A/en
Priority to DK07816046.2T priority patent/DK2102619T3/en
Priority to ES07816046.2T priority patent/ES2624718T3/en
Priority to KR1020097010701A priority patent/KR101406113B1/en
Priority to CN2007800480774A priority patent/CN101578508B/en
Application filed by Voiceage Corporation filed Critical Voiceage Corporation
Priority to CA2666546A priority patent/CA2666546C/en
Priority to JP2009533622A priority patent/JP5166425B2/en
Priority to BRPI0718300-3A priority patent/BRPI0718300B1/en
Priority to US12/446,892 priority patent/US8401843B2/en
Priority to EP07816046.2A priority patent/EP2102619B1/en
Publication of WO2008049221A1 publication Critical patent/WO2008049221A1/en
Priority to NO20092017A priority patent/NO341585B1/en
Priority to HK09112127.2A priority patent/HK1132324A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/12Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a code excitation, e.g. in code excited linear prediction [CELP] vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/09Long term prediction, i.e. removing periodical redundancies, e.g. by using adaptive codebook or pitch predictor

Definitions

  • the present invention relates to a technique for digitally encoding a sound signal, for example a speech or audio signal, in view of transmitting and synthesizing this sound signal.
  • the present invention relates a method and device for encoding transition frames and frames following the transition in a sound signal, for example a speech or audio signal, in order to reduce the error propagation at the decoder in case of frame erasure and/or to enhance coding efficiency mainly at the beginning of voiced segments (onset frames).
  • the method and device replace the adaptive codebook typically used in predictive encoders by a codebook of, for example, glottal impulse shapes in transition frames and in frames following the transition.
  • the glottal-shape codebook can be a fixed codebook independent of the past excitation whereby, once the frame erasure is over, the encoder and the decoder use the same excitation so that convergence to clean-channel synthesis is quite rapid.
  • the past excitation buffer is updated using the noise-like excitation of the previous unvoiced or inactive frame that is very different from the current excitation.
  • the proposed technique can build the periodic part of the excitation very accurately.
  • a speech encoder converts a speech signal into a digital bit stream which is transmitted over a communication channel or stored in a storage medium.
  • the speech signal is digitized, that is sampled and quantized with usually 16-bits per sample.
  • the speech encoder has the role of representing these digital samples with a smaller number of bits while maintaining a good subjective speech quality.
  • the speech decoder or synthesizer operates on the transmitted or stored bit stream and converts it back to a speech signal.
  • CELP Code-Excited Linear Prediction
  • M the sampled speech signal is processed in successive blocks of M samples usually called frames, where M is a predetermined number corresponding typically to 10-30 ms.
  • a linear prediction (LP) filter is computed and transmitted every frame. The computation of the LP filter typically needs a lookahead, a 5-15 ms speech segment from the subsequent frame.
  • the M-sample frame is divided into smaller blocks called subframes. Usually the number of subframes is three or four resulting in 4-10 ms subframes.
  • an excitation signal is usually obtained from two components, the past excitation and the innovative, fixed-codebook excitation.
  • the component formed from the past excitation is often referred to as the adaptive codebook or pitch excitation.
  • the parameters characterizing the excitation signal are coded and transmitted to the decoder, where the reconstructed excitation signal is used as the input of the LP filter.
  • CELP-type speech codecs rely heavily on prediction to achieve their high performance.
  • the prediction used can be of different kinds but usually comprises the use of an adaptive codebook containing an excitation signal selected in past frames.
  • a CELP encoder exploits the quasi periodicity of voiced speech signal by searching in the past excitation the segment most similar to the segment being currently encoded. The same past excitation signal is maintained also in the decoder. It is then sufficient for the encoder to send a delay parameter and a gain for the decoder to reconstruct the same excitation signal as is used in the encoder.
  • the evolution (difference) between the previous speech segment and the currently encoded speech segment is further modeled using an innovation selected from a fixed codebook.
  • the CELP technology will be described in more detail herein below.
  • a problem of strong prediction inherent in CELP-based speech coders appears in presence of transmission errors (erased frames or packets) when the state of the encoder and the decoder become desynchronized. Due to the prediction, the effect of an erased frame is thus not limited to the erased frame, but continues to propagate after the erasure, often during several following frames. Naturally, the perceptual impact can be very annoying.
  • Transitions from unvoiced speech segment to voiced speech segment are the most problematic cases for frame erasure concealment.
  • voiced onset When a transition from unvoiced speech segment to voiced speech segment (voiced onset) is lost, the frame right before the voiced onset frame is unvoiced or inactive and thus no meaningful periodic excitation is found in the buffer of the past excitation (adaptive codebook).
  • the past periodic excitation builds up in the adaptive codebook during the onset frame, and the following voiced frame is encoded using this past periodic excitation.
  • An object of the present invention is therefore to provide a method and device for encoding transition frames in a predictive speech and/or audio encoder in order to improve the encoder robustness against lost frames and/or improve the coding efficiency.
  • Another object of the present invention is to eliminate error propagation and increase coding efficiency in CELP-based codecs by replacing the inter-frame dependent adaptive codebook search by a non-predictive, for example glottal-shape, codebook search.
  • This technique requires no extra delay, negligible additional complexity, and no increase in bit rate compared to traditional CELP encoding.
  • a transition mode method for use in a predictive-type sound signal codec for producing a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in the sound signal comprising: providing a transition mode codebook for generating a set of codevectors independent from past excitation; supplying a codebook index to the transition mode codebook; and generating, by means of the transition mode codebook and in response to the codebook index, one of the codevectors of the set corresponding to the transition mode excitation.
  • a transition mode device for use in a predictive-type sound signal codec for producing a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in the sound signal, comprising an input for receiving a codebook index and a transition mode codebook for generating a set of codevectors independent from past excitation.
  • the transition mode codebook is responsive to the index for generating, in the transition frame and/or frame following the transition, one of the codevectors of the set corresponding to said transition mode excitation.
  • an encoding method for generating a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in a sound signal comprising: generating a codebook search target signal; providing a transition mode codebook for generating a set of codevectors independent from past excitation, the codevectors of the set each corresponding to a respective transition mode excitation; searching the transition mode codebook for finding the codevector of the set corresponding to a transition mode excitation optimally corresponding to the codebook search target signal.
  • an encoder device for generating a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in a sound signal, comprising: a generator of a codebook search target signal; a transition mode codebook for generating a set of codevectors independent from past excitation, the codevectors of the set each corresponding to a respective transition mode excitation; and a searcher of the transition mode codebook for finding the codevector of the set corresponding to a transition mode excitation optimally corresponding to the codebook search target signal.
  • a decoding method for generating a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in a sound signal comprising: receiving a codebook index; supplying the codebook index to a transition mode codebook for generating a set of codevectors independent from past excitation; and generating, by means of the transition mode codebook and in response to the codebook index, one of the codevectors of the set corresponding to the transition mode excitation.
  • a decoder device for generating a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in a sound signal, comprising an input for receiving a codebook index and a transition mode codebook for generating a set of codevectors independent from past excitation.
  • the transition mode codebook is responsive to the index for generating in the transition frame and/or frame following the transition one of the codevectors of the set corresponding to the transition mode excitation.
  • Figure 1a is a schematic block diagram of a CELP-based encoder
  • Figure 1b is a schematic block diagram of a CELP-based decoder
  • Figure 2 is a schematic block diagram of a frame classification state machine for erasure concealment
  • Figure 3 is an example of segment of a speech signal with one voiced transition frame and one onset frame;
  • FIG. 4 is a functional block diagram illustrating a classification rule to select TM (Transition Mode) frames in speech onsets, where N_TM_FRAMES stands for a number of consecutive frames to prevent using a TM coding technique, 'clas' stands for a frame class, and VOICED_TYPE means ONSET, VOICED and VOICED TRANSITION classes;
  • Figure 5a is a schematic illustration of an example of frame of a speech signal divided into four (4) subframes, showing the speech signal in the time domain;
  • Figure 5b is a schematic illustration of an example of frame of a speech signal divided into four (4) subframes, showing a LP residual signal
  • Figure 5c is a schematic illustration of an example of frame of a speech signal divided into four (4) subframes, showing a first stage excitation signal constructed using the TM coding technique in the encoder;
  • Figure 6 show graphs illustrating eight glottal impulses with 17- sample length used for the glottal-shape codebook construction, wherein the x-axis denotes a discrete time index and the y-axis an amplitude of the impulse;
  • Figure 7 is a schematic block diagram of an example of TM portion of a CELP encoder, where k' represents a glottal-shape codebook index and G(z) is a shaping filter;
  • Figure 8 is a graphical representation of the computation of CW, the square root of the numerator in the criterion of Equation (16), wherein shaded portions of the vector/matrix are non-zero;
  • Figure 9 is a graphical representation of the computation of Ek', the denominator of the criterion of Equation (16) ), wherein shaded portions of the vector/matrix are non-zero;
  • Figure 11 is a schematic block diagram of an example of TM portion of a CELP decoder
  • Figure 12a is a schematic block diagram an example of structure of the filter Q(z);
  • Figure 12b is a graph of an example of glottal-shape codevector modification, wherein the repeated impulse is dotted;
  • Figure 13 is a schematic block diagram of the TM portion of a
  • CELP encoder including the filter Q(z);
  • Figure 14 is a graph illustrating a glottal-shape codevector with two-impulses construction when an adaptive codebook search is used in a part of the subframe with a glottal-shape codebook search;
  • Figure 15 is a graph illustrating a glottal-shape codevector construction in the case where the second glottal impulse appears in the first L 1/2 positions of the next subframe;
  • Figure 16 is a schematic block diagram of the TM portion of an encoder used in a EV-VBR (Embedded Variable Bit Rate) codec implementation
  • Figure 17a is a graph showing an example of speech signal in the time domain
  • Figure 17b is a graph showing a LP residual signal corresponding to the speech signal of Figure 17a;
  • Figure 17c is a graph showing a first-stage excitation signal in error-free conditions
  • Figures 18a-18c are graphs illustrating an example of onset construction comparison, wherein the graph of Figure 18a represents the input speech signal, the graph of Figure 18b represents the output synthesized speech of a EV-VBR codec without the TM coding technique, and the graph of Figure 18c represents the output synthesized speech of a EV-VBR codec with the TM coding technique;
  • Figure 19a-19c are graphs illustrating an example of the effect of the TM coding technique in the case of frame erasure, wherein the graph of Figure 19a represents the input speech signal, the graph of Figure 19b represents the output synthesized speech of a EV-VBR codec without the TM coding technique, and the graph of Figure 19c represents the output synthesized speech of a EV- VBR codec with the TM coding technique;
  • Figure 20 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITION_1_1 ;
  • Figure 21 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITION_1_2;
  • Figure 22 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITIONS _3;
  • Figure 23 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITION_1_4;
  • Figure 24 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITION_2;
  • Figure 25 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITION_3;
  • Figure 26 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITION_4;
  • Figure 27 is a schematic block diagram of a speech communication system illustrating the use of speech encoding and decoding devices.
  • the non-restrictive illustrative embodiment of the present invention is concerned with a method and device whose purpose is to overcome error propagation in the above described situations and increase the coding efficiency.
  • the method and device implement a special encoding, called transition mode (TM) encoding technique, of transition frames and frames following the transition in a sound signal, for example a speech or audio signal.
  • TM transition mode
  • the TM coding technique replaces the adaptive codebook of the CELP codec by a new codebook of glottal impulse shapes, hereinafter designated as glottal-shape codebook, in transition frames and in frames following the transition.
  • the glottal-shape codebook is a fixed codebook independent of the past excitation. Consequently, once a frame erasure is over, the encoder and the decoder use the same excitation whereby convergence to clean-channel synthesis is quite rapid.
  • the use of the TM coding technique in frames following a transition helps to prevent error propagation in the case the transition frame is lost
  • another purpose of using the TM coding technique also in the transition frame is to improve the coding efficiency.
  • the adaptive codebook usually contains a noise-like signal not very efficient for encoding the beginning of a voiced segment.
  • the idea behind the TM coding technique is thus to supplement the adaptive codebook with a better codebook populated with simplified quantized versions of glottal impulses to encode the voiced onsets.
  • the proposed TM coding technique can be used in any CELP- type codec or predictive codec.
  • the TM coding technique is implemented in a candidate codec in ITU-T standardization activity for an Embedded Variable Bit Rate Codec that will be referred to in the remaining of the text as EV-VBR codec.
  • EV-VBR codec Embedded Variable Bit Rate Codec
  • non-restrictive illustrative embodiment of the present invention will be described in connection with a speech signal, it should be kept in mind that the present invention is not limited to an application to speech signals but its principles and concepts can be applied to any other types of sound signals including audio signals.
  • a speech frame can be roughly classified into one of the four (4) following speech classes (this will be explained in more detail in the following description):
  • Unvoiced speech frames characterized by an aperiodic structure and energy concentration toward higher frequencies
  • FIG. 27 is a schematic block diagram of a speech communication system depicting the use of speech encoding and decoding.
  • the speech communication system supports transmission and reproduction of a speech signal across a communication channel 905.
  • a communication channel 905. may comprise, for example, a wire, optical or fiber link
  • the communication channel 905 typically comprises at least in part a radio frequency link.
  • the radio frequency link often supports multiple, simultaneous speech communications requiring shared bandwidth resources such as may be found with cellular telephony.
  • the communication channel 905 may be replaced by a storage device in a single device embodiment of the communication system that records and stores the encoded speech signal for later playback.
  • a microphone 901 produces an analog speech signal that is supplied to an analog-to-digital (A/D) converter 902 for converting it into a digital form.
  • a speech encoder 903 encodes the digital speech signal thereby producing a set of encoding parameters that are coded into a binary form and delivered to a channel encoder 904.
  • the optional channel encoder adds redundancy to the binary representation of the coding parameters before transmitting them over the communication channel 905.
  • a channel decoder 906 utilizes the above mentioned redundant information in the received bit stream to detect and correct channel errors that have occurred in the transmission.
  • a speech decoder 907 converts the bit stream received from the channel decoder 906 back to a set of encoding parameters for creating a synthesized digital speech signal.
  • the synthesized digital speech signal reconstructed in the speech decoder 907 is converted to an analog form in a digital- to-analog (D/A) converter 908 and played back in a loudspeaker unit 909.
  • D/A digital- to-analog
  • a speech codec consists of two basic parts: an encoder and a decoder.
  • the encoder digitizes the audio signal, chooses a limited number of encoding parameters representing the speech signal and converts these parameters into a digital bit stream that is transmitted to the decoder through a communication channel.
  • the decoder reconstructs the speech signal to be as similar as possible to the original speech signal.
  • LP Linear Prediction
  • CELP CELP technology.
  • the speech signal is synthesized by filtering an excitation signal through an all-pole synthesis filter MA(z).
  • the excitation is typically composed of two parts, a first stage excitation signal is selected from an adaptive codebook and a second stage excitation signal is selected from a fixed codebook.
  • the adaptive codebook excitation models the periodic part of the excitation and the fixed codebook excitation is added to model the evolution of the speech signal.
  • the speech is normally processed by frames of typically 20 ms and the LP filter coefficients are transmitted once per frame.
  • every frame is further divided in several subframes to encode the excitation signal.
  • the subframe length is typically 5 ms.
  • the main principle behind CELP is called Analysis-by-Synthesis where possible decoder outputs are tried (synthesis) already during the encoding process (analysis) and then compared to the original speech signal.
  • the perceptual weighting filter W(z) exploits the frequency masking effect and is typically derived from the LP filter.
  • Equation (1) An example of perceptual weighting filter W(z) is given in the following Equation (1):
  • W(z) A(z / ⁇ , A(ZIy 1 ) (1) where factors y-i and ⁇ 2 control the amount of perceptual weighting and holds the relation 0 ⁇ ⁇ 2 ⁇ Ki ⁇ 1 ⁇
  • This traditional perceptual weighting filter works well for NB (narrowband - bandwidth of 200 - 3400 Hz) signals.
  • An example of perceptual weighting filter for WB (wideband - bandwidth of 50 - 7000 Hz) signals can be found in Reference [1].
  • the bit stream transmitted to the decoder contains for the voiced frames the following encoding parameters: the quantized parameters of the LP synthesis filter, the adaptive and fixed codebook indices and the gains of the adaptive and fixed parts.
  • the adaptive codebook search in CELP-based codecs is performed in weighted speech domain to determine the delay (pitch period) t and the pitch gain g p , and to construct the quasi-periodic part of the excitation signal referred to as adaptive codevector v(n).
  • the pitch period is strongly dependent on the particular speaker and its accurate determination critically influences the quality of the synthesized speech.
  • a three-stage procedure is used to determine the pitch period and gain.
  • three open-loop pitch estimates T op are computed for each frame - one estimate for each 10 ms half- frame and one for a 10 ms look-ahead - using the perceptually weighted speech signal s w (n) and normalized correlation computing.
  • a closed- loop pitch search is performed for integer periods around the estimated open-loop pitch periods T op for every subframe. Once an optimum integer pitch period is found, a third search stage goes through the fractions around that optimum integer value.
  • the closed-loop pitch search is performed by minimizing the mean-squared weighted error between the original and synthesized speech. This is achieved by maximizing the term
  • xi(n) is the target signal and the first stage contribution signal (also called filtered adaptive codevector) yi(n) is computed by the convolution of the past excitation signal v(n) at period t with the impulse response h(n) of the weighted synthesis filter H(z)
  • the perceptually weighted input speech signal s w (n) is obtained by processing the input speech signal s(n) through the perceptual weighting filter W(z).
  • the filter H(z) is formed by the cascade of the LP synthesis filter MA ⁇ z) and the perceptual weighting filter W(z).
  • the target signal x- ⁇ (n) corresponds to the perceptually weighted input speech signal s w (n) after subtracting therefrom the zero-input response of the filter H(z).
  • the pitch gain is found by minimizing the mean-squared error between the signal x- ⁇ (n) and the first stage contribution signal y- ⁇ (n).
  • the pitch gain is expressed by the following Equation:
  • the pitch gain is then bounded by 0 ⁇ g p ⁇ 1.2 and typically jointly quantized with the fixed codebook gain once the innovation is found.
  • the excitation signal in the beginning of the currently processed frame is thus reconstructed from the excitation signal from the previous frame. This mechanism is very efficient for voiced segments of the speech signal where the signal is quasi-periodic, and in absence of transmission errors.
  • the excitation signal from the previous frame is lost and the respective adaptive codebooks of the encoder and decoder are no longer the same.
  • the decoder then continues to synthesize the speech using the adaptive codebook with incorrect content.
  • a frame erasure degrades the synthesized speech quality not only during the erased frame, but it can also degrade the synthesized speech quality during several subsequent frames.
  • the traditional concealment techniques are often based on repeating the waveform of the previous correctly-transmitted frame, but these techniques work efficiently only in the signal parts where the characteristics of the speech signal are quasi stationary, for example in stable voiced segments. In this case, the difference between the respective adaptive codebooks of the encoder and decoder are often quite small and the quality of the synthesized signal is not much affected. However, if the erasure falls in a transition frame, the efficiency of these techniques is very limited. In communication systems using CELP-based codecs, where the Frame Erasure Rate (FER) is typically 3% to 5%, the synthesized speech quality then drops significantly.
  • FER Frame Erasure Rate
  • the CELP encoder makes use of the adaptive codebook to exploit the periodicity in speech that is low or missing during transitions whereby the coding efficiency runs down. This is the case of voiced onsets in particular where the past excitation signal and the optimal excitation signal for the current frame are correlated very weakly or not at all.
  • FCB CodeBook search in CELP-based codecs is to minimize the residual error after the use of the adaptive codebook, i.e.
  • g c is the fixed codebook gain
  • the second stage contribution signal (also called as the filtered fixed codevector) is the fixed codebook vector c k (n) convolved with h(n).
  • the target signal Xi(/i) is updated by subtracting the adaptive codebook contribution from the adaptive codebook target to obtain:
  • the fixed codebook can be realized for example by using an algebraic codebook as described in Reference [2]. If c k denotes the algebraic code vector at index k, then the algebraic codebook is searched by maximizing the following criterion:
  • H is the lower triangular Toeplitz convolution matrix with diagonal /7(0) and lower diagonals /?(1 ), ... , A?(/V— 1 ).
  • the superscript 7 denotes matrix or vector transpose.
  • Both d and ⁇ are usually computed prior to the fixed codebook search.
  • Reference [1] discusses that, if the algebraic structure of the fixed codebook contains only a few non-zero elements, a computation of the maximization criterion for all possible indexes k is very fast. A similar procedure is used in the transition mode (TM) encoding technique as will be seen below.
  • TM transition mode
  • CELP is believed to be otherwise well known to those of ordinary skill in the art and, for that reason, will not be further described in the present specification.
  • the frame classification in the EV-VBR codec is based on VMR-
  • VMR-WB classification is done with the consideration of the concealment and recovery strategy. In other words, any frame is classified in such a way that the concealment can be optimal if the following frame is missing, or that the recovery can be optimal if the previous frame was lost.
  • Some of the classes used for frame erasure concealment processing need not be transmitted, as they can be deduced without ambiguity at the decoder. Five distinct classes are used, and defined as follows:
  • UNVOICED class comprises all unvoiced speech frames and all frames without active speech.
  • a voiced offset frame can be also classified as UNVOICED if its end tends to be unvoiced and the concealment designed for unvoiced frames can be used for the following frame in case it is lost.
  • - UNVOICED TRANSITION class comprises unvoiced frames with a possible voiced onset at the end. The voiced onset is however still too short or not built well enough to use the concealment designed for voiced frames.
  • An UNVOICED TRANSITION frame can follow only a frame classified as UNVOICED or UNVOICED TRANSITION.
  • - VOICED TRANSITION class comprises voiced frames with relatively weak voiced characteristics. Those are typically voiced frames with rapidly changing characteristics (transitions between vowels) or voiced offsets lasting the whole frame.
  • a VOICED TRANSITION frame can follow only a frame classified as VOICED TRANSITION, VOICED or ONSET.
  • - VOICED class comprises voiced frames with stable characteristics.
  • a VOICED frame can follow only a frame classified as VOICED TRANSITION, VOICED or ONSET.
  • - ONSET class comprises all voiced frames with stable characteristics following a frame classified as UNVOICED or UNVOICED TRANSITION.
  • Frames classified as ONSET correspond to voiced onset frames where the onset is already sufficiently well built for the use of the concealment designed for lost voiced frames.
  • the concealment techniques used for a frame erasure following a frame classified as ONSET are in traditional CELP-based codecs the same as following a frame classified as VOICED, the difference being in the recovery strategy when a special technique can be used to artificially reconstruct the lost onset.
  • the TM coding technique is successfully used in this case.
  • the classification state diagram is outlined in Figure 2.
  • the classification information is transmitted using 2 bits.
  • the UNVOICED TRANSITION class and VOICED TRANSITION class can be grouped together as they can be unambiguously differentiated at the decoder (an UNVOICED TRANSITION frame can follow only UNVOICED or UNVOICED TRANSITION frames, a VOICED TRANSITION frame can follow only ONSET, VOICED or VOICED TRANSITION frames).
  • the computation of these parameters uses a lookahead.
  • the lookahead allows the evolution of the speech signal in the following frame to be estimated and, consequently, the classification can be done by taking into account the future speech signal behaviour.
  • the maximum normalized correlations C nom are computed as a part of the open-loop pitch search and correspond to the maximized normalized correlations of two adjacent pitch periods of the weighted speech signal.
  • the spectral tilt parameter e' t contains the information about the frequency distribution of energy.
  • the spectral tilt for one spectral analysis is estimated as a ratio between the energy concentrated in low frequencies and the energy concentrated in high frequencies.
  • the tilt measure used is the average in the logarithmic domain of the spectral tilt measures e "" ⁇ and e "" ⁇ defined as a low and high frequency energies ratio. That is:
  • T op0 , T op1 , and T 0P2 correspond to the open-loop pitch estimates from the first half of the current frame, the second half of the current frame and the lookahead, respectively.
  • the relative frame energy E ret is computed as a difference in dB between the current frame energy and the long-term active-speech energy average.
  • the last parameter is the zero-crossing parameter zc computed on a 20 ms segment of the speech signal.
  • the segment starts in the middle of the current frame and uses two subframes of the lookahead.
  • the zero-crossing counter zc counts the number of times the speech signal sign changes from positive to negative during that interval.
  • the classification parameters are considered together forming a function of merit f m .
  • the classification parameters are first scaled between 0 and 1 so that parameter's value typical for unvoiced speech signal translates into 0 and each parameter's value typical for voiced speech signal translates into 1. A linear function is used between them.
  • the scaled version p s of a certain parameter p x is obtained using the Equation:
  • a first classification decision is made for the UNVOICED class as follows:
  • the class information is encoded with two bits as explained herein above.
  • the supplementary information which improves frame erasure concealment, is transmitted only in Generic frames, the classification is performed for each frame. This is needed to maintain the classification state machine up to date as it uses the information about the class of the previous frame.
  • the classification is however straightforward for encoding types dedicated to UNVOICED or VOICED frames. Hence, voiced frames are always classified as VOICED and unvoiced frames are always classified as UNVOICED.
  • the technique being described replaces the adaptive codebook in CELP-based coders by a glottal-shape codebook to improve the robustness to frame erasures and to enhance the coding efficiency when non-stationary speech frames are processed.
  • This technique does not construct the first stage excitation signal with the use of the past excitation, but selects the first stage excitation signal from the glottal-shape codebook.
  • the second stage excitation signal (the innovation part of the total excitation) is still selected from the traditional CELP fixed codebook. Any of these codebooks use no information from the past (previously transmitted) speech frames, thereby eliminating the main reason for frame error propagation inherent to CELP-based encoders.
  • the TM coding technique can be applied only to the transition frames and to several frames following each transition frame.
  • the TM coding technique can be used for voiced speech frames following transitions. As introduced previously, these transitions comprise basically the voiced onsets and the transitions between two different voiced sounds.
  • transitions are detected. While any detector of transitions can be used, the non-restrictive illustrative embodiment uses the classification of the EV- VBR framework as described herein above.
  • the TM coding technique can be applied to encode transition
  • TM frames frames encoded using the TM coding technique
  • TM coding technique the number of TM frames (frames encoded using the TM coding technique) is a matter of compromise between the codec performance in clean-channel conditions and in conditions with channel errors. If only the transition (voiced onset or transition between two different voiced sounds) frames are encoded using the TM coding technique, the encoding efficiency increases. This increase can be measured by the increase of the segmental signal-to-noise ratio [SNR), for example.
  • SNR segmental signal-to-noise ratio
  • E sd is the energy of the input speech signal of the current frame
  • E e is the energy of the error between this input speech signal and the synthesis speech signal of the current frame.
  • the TM coding technique to encode only the transition frames does not help too much for error robustness; if the transition (voiced onset or transition between two different voiced sounds) frame is lost, the error will propagate as the following frames would be coded using the standard CELP procedure. On the other hand, if the frame preceding the transition (voiced onset or transition between two different voiced sounds) frame is lost, the effect of this lost preceding frame on the performance is not critical even without the use of the TM coding technique. In the case of voiced onset transitions, the frame preceding the onset is likely to be unvoiced and the adaptive codebook contribution is not much important. In the case of a transition between two voiced sounds, the frame before the transition is generally fairly stationary and the adaptive codebook state in the encoder and the decoder are often similar after the frame erasure.
  • the TM coding technique can be used only in the frames following the transition frames. Basically, the number of consecutive TM frames depends on the number of consecutive frame erasures one wants to consider for protection. If only isolated erasures are considered (i.e. one isolated frame erasure at a time), it is sufficient to encode only the frame following the transition (voiced onset or transition between two different voiced sounds) frame. If the transition (voiced onset or transition between two different voiced sounds) frame is lost, the following frame is encoded without the use of the past excitation signal and the error propagation is broken.
  • the following scheme to set the onset and the following frames for TM coding can be used.
  • a parameter state that is a counter of the consecutive TM frames previously used is stored in the encoder state memory. If the value of this parameter state is negative, TM coding cannot be used. If the parameter state is not negative but lower or equal to the number of consecutive frame erasures to protect, and the class of the frame is ONSET, VOICED or VOICED TRANSITION, the frame is denoted as TM frame (see Figure 4 for more detail). In other words, the frame is denoted as TM frame if N_TM_FRAMES > state > 0, where N_TM_FRAMES is a number of consecutive frames to prevent using the TM coding technique.
  • the best solution might be to use the TM coding technique to protect two or even more consecutive frame erasures.
  • the coding efficiency in clean-channel conditions will drop.
  • the number of the consecutive TM frames might be made adaptive to the conditions of transmission. In the non-restrictive illustrative embodiment of the present invention, up to two TM frames following the transition (voiced onset or transition between two different voiced sounds) frame are considered, which corresponds to a design able to cope with up to two consecutive frame erasures.
  • TM frames following the transition (voiced onset or transition between two different voiced sounds) frame.
  • the compromise between the clean-channel performance and the frame-error robustness can be also based on a closed-loop classification. More specifically, in the frame that one wants to protect against the previous frame erasure or wants to decide if it is the onset frame, a computation of the two possible coding modes is done in parallel; the frame is processed both using the generic (CELP) coding mode and the TM coding technique. Performance of both approaches is then compared using a SNR measure, for example; for more details the following Section entitled "TM coding Technique Performance in EV-VBR Codec".
  • the generic (CELP) coding mode When the difference between the SNR for the generic (CELP) coding mode and the SNR for the TM coding technique is greater than a given threshold, the generic (CELP) coding mode is applied. If the difference between the SNR for the generic (CELP) coding mode and the SNR for the TM coding technique is smaller than the given threshold, the TM coding technique is applied.
  • the value of the threshold is chosen depending on how strong the frame erasure protection and onset coding determination is required.
  • the reasons and mechanisms for selecting frames for coding using the TM coding technique was described. Now it will be shown that it is generally more efficient not to use the glottal-shape codebook in all subframes in order to achieve the best compromise between the clean-channel performance at a given bitrate and the performance in presence of an erasure in the frames preceding the TM frames.
  • the glottal-shape codebook search is important only in the first pitch-period in a frame.
  • the following pitch periods can be encoded using the more efficient standard adaptive codebook search since they no longer use the excitation of the past frame (when the adaptive codebook is searched, the excitation is searched up to about one pitch period in the past). There is consequently no reason to employ the glottal-shape codebook search in subframes containing no portion of the first pitch period of a frame.
  • this glottal-shape codebook search is used on the first pitch period of the starting voiced segment.
  • the adaptive codebook contains a noise- like signal (the previous segment was not voiced) and replacing it with a quantized glottal impulse often increases the coding efficiency.
  • the periodic excitation has already built up in the adaptive codebook and using this codebook will yield better results. For this reason, the information on the voiced onset position is available at least with subframe resolution.
  • bit allocation concerns frames with pitch periods longer than the subframe length.
  • the codebook contains quantized shapes of the glottal impulse
  • the codebook is best suited to be used in subframes containing the glottal impulse. In other subframes, its efficiency is low.
  • the bit rate is often quite limited in speech encoding applications and that the encoding of the glottal-shape codebook requires a relatively larger number of bits for low bit rate speech encoding
  • a bit allocation where the glottal-shape codebook is used and searched only in one subframe per frame was chosen in the non-restrictive, illustrative embodiment.
  • the first glottal impulse in the LP residual signal is looked for.
  • the following simple procedure can be used.
  • the maximum sample in the LP residual signal is searched in the range [0, 0 + T op +2], where T Qp is the open-loop pitch period for the first half-frame and 0 corresponds to the frame beginning.
  • T Qp is the open-loop pitch period for the first half-frame
  • 0 corresponds to the frame beginning.
  • 0 denotes the beginning of the subframe where the onset beginning is located.
  • the glottal-shape codebook will be then employed in the subframe with the maximum residual signal energy.
  • the other subframes (not encoded with the use of the glottal- shape codebook) will be processed as follows. If the subframe using glottal-shape codebook search is not the first subframe in the frame, the excitation signal in preceding subframe(s) of the frame is encoded using the fixed CELP codebook only; this means that the first stage excitation signal is zero. If the glottal-shape codebook subframe is not the last subframe in the frame, the following subframe(s) of the frame is/are processed using standard CELP encoding (i.e. using the adaptive and the fixed codebook search). In Figures 5a-5c, the situation is shown for the case where the first glottal impulse emerges in the 2nd subframe.
  • u(n) is the LP residual signal.
  • the first stage excitation signal is denoted q ⁇ ⁇ (n) when it is built using the glottal-shape codebook, or v(n) when it is built using the adaptive codebook.
  • the first stage excitation signal is zero in the 1 st subframe, it is a glottal-shape codevector in the 2 nd subframe and a adaptive codebook vector in the last two subframes.
  • TM subframe the subframe with the 2 nd glottal impulse in the LP residual signal is determined. This determination is based on the pitch period value and the following four situations then can occur. In the first situation, the 2 nd glottal impulse is in the 1 st subframe, and the 2 nd , 3 rd and 4 th subframes are processed using standard CELP encoding (adaptive and fixed codebook search).
  • the 2 nd glottal impulse is in the 2 nd subframe, and the 2nd, 3rd and 4th subframes are processed using again standard CELP encoding.
  • the 2 nd glottal impulse is in the 3 rd subframe.
  • the 2 nd subframe is processed using fixed codebook search only as there is no glottal impulse in the 2 nd subframe of the LP residual signal to be searched for using the adaptive codebook.
  • the 3 rd and 4 th subframes are processed using standard CELP encoding.
  • the 2 nd glottal impulse is in the 4 th , subframe (or in the next frame), the 2 nd and 3 rd subframes are processed using the fixed codebook search only, and the 4 th , subframe is processed using standard CELP encoding. More detailed discussion is provided in an exemplary implementation later below.
  • Table 3 shows names of the possible coding configurations and their occurrence statistics.
  • Table 3 gives the distribution of the first and the second glottal impulse occurrence in each subframe for frames processed with the TM coding technique.
  • Table 3 corresponds to the scenario where the TM coding technique is used to encode only the voiced onset frame and one subsequent frame.
  • the frame length of the speech signal in this experiment was 20 ms, the subframe length 5 ms and the experiment was conducted using voices of 32 men and 32 women (if not mentioned differently, the same speech database was used also in all other experiments mentioned in the following description).
  • Table 3 - Coding mode configurations for TM and their occurrence when speech signal is processed.
  • the glottal-shape codebook consists of quantized normalized shapes of the glottal impulses placed at a specific position. Consequently, the codebook search consists both in the selection of the best shape, and in the determination of its best position in a particular subframe.
  • the shape of the glottal impulse can be represented by a unity impulse and does not need to be quantized. In that case, only its position in the subframe is determined. However the performance of such a simple codebook is very limited.
  • the quantized shapes have been selected such that the absolute maximum is around the middle of this length.
  • this middle is aligned with the index k' which represents the position of the glottal impulse in the current subframe and is chosen from the interval [0, N - 1], N being the subframe length.
  • the codebook entries length of 17 samples is shorter than the subframe length, the remaining samples are set to zero.
  • the glottal-shape codebook is designed to represent as many existent glottal impulses as possible.
  • a training process based on the k-means algorithm [4] was used; the glottal-shape codebook was trained using more than three (3) hours of speech signal composed of utterances of many different speakers speaking in several different languages. From this database, the glottal impulses have been extracted from the LP residual signal and truncated to 17 samples around the maximum absolute value. From the sixteen (16) shapes selected by the k-means algorithm, the number of shapes has been further reduced to eight (8) shapes experimentally using a segmental SNR quality measure. The selected glottal-shape codebook is shown in Figure 6. Obviously, other means can be used to design the glottal-shape codebook.
  • the search can be performed similar to the fixed codebook search in CELP.
  • the codebook entries can be successively placed at all potential positions in the past excitation and the best shape/position combination can be selected in a similar way as is used in the adaptive codebook search.
  • the non-restrictive illustrative embodiment uses the configuration where the codebook search is similar to the fixed codebook search in Algebraic CELP (ACELP).
  • ACELP Algebraic CELP
  • the shape is represented as an impulse response of a shaping filter G(z).
  • the codevectors corresponding to glottal impulse shapes centered at different positions can be represented by codevectors containing only one non-zero element filtered through the shaping filter G(z) (for a subframe size ⁇ / there are ⁇ / single-pulse vectors for potential glottal impulse positions k).
  • the configuration of the TM part is shown in Figure 7 for the encoder and in Figure 11 for the decoder. As already mentioned, the TM part replaces the adaptive codebook part of the encoder/decoder. During the search, the impulse response of the shaping filter G(z) can be integrated to the impulse response of the filter H(z).
  • a procedure and corresponding codebook searcher for searching the optimum glottal impulse center position W for a certain shape of the glottal impulse rendered by the shaping filter G(z) will now be described. Because the shape of the filter G(z) is chosen from several candidate shapes (eight (8) shapes are used in the non-restrictive illustrative embodiment as illustrated in Figure 6), the search procedure must be repeated for each glottal shape of the codebook in order to find the optimum impulse shape and position.
  • the search determines the mean-squared error between the target vector X 1 and the glottal-shape codevector centered at position k' that is filtered through the weighted synthesis filter H(z). Similar to CELP, the search can be performed by finding the maximum of a criterion in the form:
  • Equation (x?y, ) 2 (x.' Hq,) 2 _ (x[HGp t , ) 2
  • H is the lower triangular Toeplitz convolution matrix of the weighted synthesis filter.
  • the rows of the matrix Z ⁇ correspond to the filtered shifted version of the glottal impulse shape or its truncated representation. Note that all vectors in this text are supposed column vectors (N x 1 matrices).
  • g(n) are the coefficients of the impulse response of the non-causal shaping filter G(z).
  • the coefficients of the non-causal shaping filter G(z) axe given by the values g(n), for n located within the range [-Ly 2 , L 1/2 ]. Because of the fact that the position codevector p ⁇ has only one non-zero element, the computation of the criterion (16) is very simple and can be expressed using the following Equation:
  • Equation (18) As it can be seen from Equation (18), only the diagonal of the matrix ⁇ g needs to be computed.
  • Equation (18) is typically used in the ACELP algebraic codebook search by precomputing the backward filtered target vector d g and the correlation matrix ⁇ g .
  • this cannot be directly applied for the first Ly 2 positions.
  • a more sophisticated search is used where some computed values can still be reused to maintain the complexity at a low level. This will be described hereinafter.
  • z k ⁇ o be the (/c'+i)" 7 row of the matrix Z ⁇ , where the matrix Z ⁇ ( Figure 10) is computed as follows. Given the non-causal nature of the shaping filter G(z), the matrix Z ⁇ is computed in two stages to minimize the computational complexity. The first Ly 2 + 1 rows of this matrix are first computed. For the remaining part of the matrix Z ⁇ (the last N - Ly 2 - 1 rows of the matrix Z 7 ), the criterion (18) is used in a manner similar to the ACELP fixed codebook search.
  • the first L V2 + 1 rows of the matrix Z ⁇ that correspond to the positions W within the range [0, L V2 ] are computed. For these positions a different truncated glottal shape is used for each position W within this range.
  • a convolution between the glottal-shape response for position W - 0 and the impulse response h(n) is computed using the Equation:
  • Equation (21) For the following rows, the recursion in Equation (21) is reused:
  • the criterion (18) can be computed in a manner similar to that described in the above section Fixed codebook search to further reduce the computational complexity.
  • ⁇ g (N-2,N-2) ⁇ g (N- ⁇ ,N- ⁇ ) + z L ⁇ i2 (L m + ⁇ )z L ⁇ n (L ⁇ 2 + ⁇ ).
  • numerator and the denominator of criterion (18) are calculated for all positions k' > L 1/2 .
  • the above described procedure allows to find the maximum of the criterion (18) for codevectors that represent the first shape from the glottal impulses.
  • the search will continue using the previously described procedure for all other glottal impulse shapes.
  • the maximum of criterion (18) search continues as glottal-shape codebook search to find one maximum value for criterion (18) that corresponds to the one glottal-shape and one position W constituting the result of the search.
  • the criterion (18) is computed for all possible glottal impulse positions k'.
  • the search is performed only in a restrained range around the expected position of the position k' to further reduce the computational complexity.
  • This expected position is in the range [k min , k max ], 0 ⁇ k min ⁇ k max ⁇ N, and can be determined for the first glottal shape from the LP residual signal maximum found as described in the above Section Subframe Selection for Glottal-Shape Codebook Search.
  • a glottal-shape codebook search is then performed and position k' is found for the first glottal shape.
  • the new range [kmin, k max ] is set for the second glottal shape search as follows: kmin — K — ⁇ ,
  • Equation (30) is used to define the search range for the third shape around the selected position of the second shape and so on.
  • ⁇ g (N-S,N-S) ⁇ g (N-l,N-l) + z Lu2 (L, /2 + 8)z L ⁇ i2 (L m + S).
  • the last parameter to be determined in the glottal-shape codebook search is the gain g p that can be computed as in Equation (4) with the difference that it is not bounded as in the adaptive codebook search.
  • the reason is that the filtered glottal-shape codevector is constructed using normalized quantized glottal shapes with energy very different from the energy of the actual excitation signal impulses.
  • the subframe can contain more than one glottal impulse (especially in the configuration TRANSITION_1_1). In this case it is necessary to model all the glottal impulses. Given the pitch period length limitations and the subframe length, a subframe cannot contain more than two glottal impulses in this non-restrictive illustrative embodiment.
  • the pitch period TQ can be determined for example by the standard closed-loop pitch search approach.
  • This technique adds the missing glottal impulse at the correct position into the glottal-shape codevector. This is illustrated as the dotted impulse in Figure 12 b. This situation appears when the sum of the glottal impulse central position k' and the pitch period T 0 is less than the subframe length ⁇ /, i.e.
  • the pitch period value is also used to build the fixed codevector when pitch sharpening in the algebraic codebook is used.
  • the repetition filter Q(z) is inserted into the TM part of the codec between the filters G(z) and H(z), as shown in the block diagram of Figure 13 for the encoder. The same change is made in the decoder. Similarly to pitch sharpening, the impulse response of the repetition filter Q(z) can be added to the impulse response of G(z) and H(z) prior to the codebook search so that both impulses are taken into account during the search while keeping the complexity of the search at a low level.
  • Another approach to build the glottal-shape codevector with two glottal impulses in one subframe is to use an adaptive codebook search in a part of the subframe.
  • the first T 0 samples of the glottal-shape codevector q k (n) are build using the glottal-shape codebook search and then the other samples in the subframe are build using the adaptive search as shown in Figure 14.
  • This approach is more complex, but more accurate.
  • the TM coding technique has been implemented in the EV-VBR codec.
  • the EV-VBR classification procedure has been adapted to select frames to be encoded using the TM coding technique.
  • the gain of the glottal-shape codebook contribution is quantized in two steps as depicted in Figure 16, where G(z) is the shaping filter, k' is the position of the centre of the glottal shape and g m is a TM gain, i.e.
  • TM gain g m is found in the same way as the pitch gain using Equation (4) only with the difference that it is not bounded. It is then quantized by means of a 3-bit scalar quantizer and one bit for sign is used. The glottal-shape codevector is then scaled using this gain g m . After both contributions to the filtered excitation signal (first and second stage contribution signals, i.e.
  • the gain of the first stage excitation signal is further adjusted jointly with the second stage excitation signal gain quantization, using the standard EV-VBR gain vector quantization (VQ).
  • VQ gain vector quantization
  • the gain quantization codebooks of EV-VBR designed for generic or voiced coding modes could be used also in TM coding.
  • the transmitted parameters related to the TM coding technique are listed in Table 4 with the corresponding number of bits.
  • the parameter T 0 which is used to determine the filter Q(z) or perform adaptive search for the second glottal impulse in case of two impulses in one subframe, is transmitted when T 0 ⁇ N.
  • the remaining parameters used for a TM frame, but common with the generic ACELP processing, are not shown here (frame identification bits, LP parameters, pitch delay for adaptive excitation, fixed codebook excitation, 1st and 2nd stage codebook gains).
  • TM parameters are added to the bit stream, the number of bits originally allocated to other EV-VBR parameters is reduced in order to maintain a constant bit rate. These bits can be reduced for example from the fixed codebook excitation bits as well as from the gain quantization.
  • FIG. 17 an example of the impact of the TM coding technique is shown for clean-channel condition.
  • Figure 17a shows the input speech signal
  • Figure 17b shows the LP residual signal
  • Figure 17c shows the first stage excitation signal where the TM coding technique is used in the first three (3) frames.
  • the difference between the residual signal and the first stage excitation signal is more pronounced in the beginning of each frame.
  • the first stage excitation signal corresponds more closely to the residual signal because the standard adaptive codebook search is used.
  • Tables 5 and 6 summarize some examples of the performance of the TM coding technique measured using SNR values.
  • Table 6 summarizes an example of the performance of the EV-
  • WB input speech signal and glottal-shape codebook with eight (8) shapes of length seventeen (17) samples.
  • N the SNRs values show some degradation for clean channel when the TM coding technique is used, even if it is used in one frame only. This is caused mostly because of the limited length of the glottal-shape impulses. In comparison to the NB example, more zero values are presented in the first stage excitation signal in the subframe.
  • the benefit of using the TM coding technique in this example is in the FE (Frame Erasure) protection.
  • Table 7 summarizes the computing complexity issue of the TM coding technique.
  • the TM coding technique increases the complexity in the encoder by 1.8 WMOPS (Weighted Millions of Operations Per Second).
  • the complexity in the decoder remains approximately the same.
  • Table 7 Complexity of the TM coding technique (worst case and average values).
  • the following figures illustrate the performance of the TM coding technique for voiced onset frame modeling ( Figures 18a-18c) and for frame error propagation mitigation ( Figures 19a-19c).
  • the TM coding technique is used only in one frame at a time in this example.
  • a segment of the input speech signal ( Figures 18a and 19a), the corresponding output synthesized speech signal processed by the EV-VBR decoder without the TM coding technique as illustrated in Figures 18b and 19b, and the output synthesized speech signal processed using the standard EV-VBR decoder with TM coding technique ( Figures 18c and 19c) are shown.
  • the benefits of the TM coding technique can be observed both in the modeling of the voiced onset frame (2nd frame of Figure 18) and in the limitation of frame error propagation (4th and 5th frames of Figure 19).
  • the frame erasure concealment technique used in the EV-VBR decoder is based on the use of an extra decoder delay of 20 ms length (corresponding to one frame length). It means that if a frame is missing, it is concealed with the knowledge of the future frame parameters. Let us suppose three (3) consecutive frames that are denoted as m-1 , m and m+1 and further suppose a situation when the frame m is missing.
  • an interpolation of the last correctly received frame m-1 and the following correctly received frame m+1 can be computed in view of determining the codec parameters, including in particular but not exclusively the LP filter coefficients (represented by /SFs - lmmitance Spectral Frequencies), closed-loop pitch period T 0 , pitch and fixed codebook gains.
  • the interpolation helps to estimate the lost frame parameters more accurately for stable voiced segments. However, it often fails for transition segments when the codec parameters vary rapidly. To cope with this problem, the absolute value of the pitch period can be transmitted in every TM frame even in the case that it is not used for the first stage excitation construction in the current frame m+1. This is valid especially for configurations TRANSITION_1_4 and TRANSITION_4.
  • TM frame Other parameters transmitted in a TM frame are the /SFs of the preceding frame.
  • the ISF parameters are generally interpolated between the previous frames /SFs and the current frame /SFs for each subframe. This ensures a smooth evolution of the LP synthesis filter from one subframe to another.
  • the /SFs of the frame preceding the frame erasure are usually used for the interpolation in the frame following the erasure, instead of the erased frame /SFs.
  • the /SFs vary rapidly and the last-good frame /SFs might be very different from the /SFs of the missing, erased frame.
  • Replacing the missing frame /SFs by the /SFs of the previous frame may thus cause important artefacts. If the past frame /SFs can be transmitted, they can be used for ISF interpolation in the TM frame in case the previous frame is erased. Later, different estimations of LP coefficients used for the ISF interpolation when the frame preceding a TM frame is missing will be described.
  • VBR codec supposes that only one frame after onset/transition frame is coded using TM. In this manner, about 6.3% of active speech frames are selected for TM encoding and decoding.
  • Results for the EV-VBR codec with bit rate of 8 kbps are summarized in Table 8.
  • WB case 28% of active speech frames was classified for encoding using the TM coding technique and an increase of 0.203 dB in segmental SNR was achieved.
  • NB case 25% of active speech frames was classified for encoding using the TM coding technique and an increase of even 0.300 dB in segmental SNR was achieved.
  • This objective test increase was not confirmed by subjective listening tests that reported no preference between codec with and without the TM coding technique.
  • the TM coding technique was implemented in an EV-VBR codec candidate for ITU-T standardization.
  • Table 9 shows bit allocation tables of the original generic mode and all TM coding mode configurations that were introduced herein above. These configurations are used in the EV-VBR codec.
  • Table 9 Bit allocation tables for generic coding mode and for all TM configurations as used in the EV-VBR codec (ID stands for configuration identification, ISFs for lmmitance Spectral Frequencies and FCB for Fixed CodeBook, subfr. is subframe).
  • FCB 5 2 nd subfr. gains 1 7M subfr. ID2 subfr.
  • FCB 5 3 rd subfr. gains 5 2 nd subfr. gains
  • FCB 20 1 st subfr.
  • FCB 12 4 th subfr.
  • FCB 20 2 nd subfr.
  • FCB 1 st subfr.
  • FCB 20 1 st subfr.
  • FCB 20 3 rd subfr.
  • FCB 20 4 th subfr.
  • FCB 12 1 st subfr. FCB 20 2 nd subfr. FCB
  • FCB 20 2 nd subfr. FCB 20 3 rd subfr. FCB
  • FCB 20 3 rd subfr. FCB 20 4 th subfr. FCB
  • This bit-allocation table can be used only in the situation when it is decided to use the TM coding technique in the frames following the voiced onset frame only (the voiced onset frame is encoded using the generic coding mode and only one frame following the voiced onset frame is encoded using the TM coding technique).
  • the pitch period T 0 is T 0 ⁇ N in the second subframe and there is no need to transmit this parameter in the 2 nd subframe.
  • the TM coding technique is used also in the voiced onset frame, the following situation may occur.
  • the pitch period is shorter than ⁇ /, but the voiced onset can start only in the 2 nd subframe (e.g. the first subframe still containing unvoiced signal).
  • the pitch period T 0 must be transmitted.
  • parameter T 0 is transmitted in the 2 nd subframe using five (5) bits and in one subframe a shorter fixed codebook is used (see Table 10).
  • the pitch period is transmitted here anyway in the present, non-limitative implementation (whether the onset frame is coded using the TM coding technique or not) because there is no good use of the saved bits for another parameter encoding.
  • bit allocations can be used in different transition mode configurations. For instance, more bits can be allocated to the fixed codebooks in the subframes containing glottal pulses. For example, in TRANSITION_3 mode, a FCB with twelve (12) bits can be used in the second subframe and twenty-eight (28) bits in the third subframe. Of course, other than 12- and 20-bit FCSs can be used in different coder implementations.
  • Table 10 Bit allocation table for configuration TRANSITION_2 if TM is used also in the onset frame.
  • FCB 20 1 st subfr.
  • FCB 12 4 th subfr.
  • VMR-WB codec is an example of a codec that uses some portion of FE protection bits. For example fourteen (14) protection bits per frame are used in the Generic Full-Rate encoding type in VMR-WB in Rate-Set II. These bits represent frame classification (2 bits), synthesized speech energy (6 bits) and glottal pulse position (6 bits). The glottal pulse is inserted artificially in the decoder when a voiced onset frame is lost.
  • FER protection bits are not much important for excitation construction in a TM frame because the TM coding technique does not make use of the past excitation signal; the TM coding technique constructs the excitation signal using parameters transmitted in the current (TM) frame.
  • These bits can be however employed for the transmission of other parameters. In an example of implementation, these bits can be used to transmit in the current TM frame the ISF parameters of the previous frame; however twelve (12) bits instead of thirty-six (36) bits are available). These /SFs are used for more precise LP filter coefficients reconstruction in case of frame erasure.
  • the set of LP parameters is computed centered on the fourth subframe, whereas the first, second, and third subframes use a linear interpolation of the LP filter parameters between the current and the previous frame.
  • the interpolation is performed on the /SPs (Immitance Spectral n(">) , h twist(">-!)
  • This interpolation is however not directly suited for the TM coding technique in the case of erasure of the previous frame.
  • the frame preceding the TM frame is missing, it can be supposed that the last correctly received frame is unvoiced. It is more efficient in this situation to reconstruct the ISF vector for the missing frame with different interpolation constants and it does not matter if we have some /SFs information from FER protection bits available or not.
  • the interpolation is using the previous frame /SPs more heavily.
  • the ISP vectors for the missing frame m can be given at the decoder, for example by using the following Equations:
  • Equations (35) The following correctly received TM frame m+1 then uses LP coefficients interpolation described by the Equations (35). Also the interpolation coefficients in Equations (36) are given as a non-limitative example. The final coefficients could be different and additionally it is desirable to use one set of interpolation coefficients when some ISF information from the previous frame is available and another set when ISF information from the previous frame is not available (i.e. there are no frame erasure protection bits in the bit stream). Pitch Period and Gain Encoding in TM Frames in EV-VBR Codec
  • the value of the pitch period T 0 is transmitted for every subframe in the generic encoding mode used in the EV-VBR codec.
  • an 8-bit encoding is used while the pitch period value is transferred with fractional [V 2 for T 0 in the range [T m ⁇ n , 91 Vz]) or integer (for T 0 in the range [92, T max ⁇ ) resolution.
  • a delta search is used and the pitch period value always with fractional resolution is coded with five (5) bits.
  • Delta search means a search within the range [T Op - 8, T 0p +7Y 2 ], where T Op is the nearest integer to the fractional pitch period of the previous (1 st or 3 rd ) subframe.
  • the pitch gain g p and the fixed codebook gain g c are encoded in the EV-VBR codec in principle in the same manner as in the AMR-WB+ codec [5].
  • the pitch gain g p and the fixed codebook gain g c are vector quantized and coded in one step using five (5) bits for every subframe.
  • the estimated fixed codebook energy is computed and quantized as follows.
  • the LP residual energy is computed in each subframe k using the following Equation:
  • the fixed codebook energy is estimated from the residual energy by removing an estimate of the adaptive codebook contribution. This is done by removing an energy related to the average normalized correlation obtained from the two open-loop pitch analyses performed in the frame.
  • Equation is used:
  • R is the average of the normalized pitch correlations obtained from the open-loop pitch analysis for each half-frame of the current frame.
  • the estimated scaled fixed codebook energy is not dependant on the previous frame energy and thus the gain encoding principle is robust to frame erasures.
  • the pitch gain and the fixed codebook gain correction are computed: the estimated scaled fixed codebook energy is used to calculate the estimated fixed codebook gain and the correction factor y (ratio between the true and the estimated fixed codebook gains).
  • the value ⁇ is vector quantized together with the pitch gain using five (5) bits per subframe.
  • a modified /(-means method [4] is used for the design of the quantizer.
  • the pitch gain is restricted within the interval ⁇ 0; 1.2> during the codebook initialization and ⁇ 0; ⁇ > during the iterative codebook improvement.
  • the correction factor y is limited by ⁇ 0; 5> during initialization and ⁇ 0; ⁇ > during the codebook improvement.
  • the modified /c-means algorithm seeks to minimize the following criterion:
  • This configuration uses in the first subframe the procedure as described above.
  • This configuration is used in the EV-VBR codec also when only one glottal impulse appears in the first subframe.
  • the pitch period T 0 holds T 0 ⁇ N and it is used for periodicity enhancement [1] in fixed codebook search.
  • Configuration TRANSITION_1_2 ( Figure 21) -
  • the first subframe is processed using the glottal-shape codebook search.
  • the pitch period is not needed and all following subframes are processed using the adaptive codebook search.
  • the pitch period maximum value holds To ⁇ 2 ⁇ / - 1. This maximum value can be further reduced thanks to knowledge of the glottal impulse position k'.
  • the pitch period value in the second subframe is then coded using seven (7) bits with a fractional resolution in the whole range.
  • delta search using five (5) bits is used with a fractional resolution.
  • Configuration TRANSITION_1_3 ( Figure 22) -
  • the first subframe is processed using the glottal-shape codebook search again with no use of the pitch period. Because the second subframe of the LP residual signal contains no glottal impulse and the adaptive search is useless, the first stage excitation signal is replaced by zeros in the second subframe.
  • the adaptive codebook parameters (T 0 and g p ) are not transmitted in the second subframe and saved bits are used for the FCB size increase in the third subframe. Because the second subframe contains a minimum of the useful information, only the 12-bits FCB is used and the 20-bits FCB is used in the fourth subframe.
  • the first stage excitation signal in the third subframe is constructed using the adaptive codebook search with the pitch period maximum value (3 /V - 1 - k 1 ) and minimum value (2 N - k); thus only a 7- bits encoding of the pitch period with fractional resolution over all the range is used.
  • the fourth subframe is processed using the adaptive search again with a 5-bits delta search encoding of the pitch period value.
  • Configuration TRANSITION_1_4 ( Figure 23) -
  • the first subframe is processed using the glottal-shape codebook search. Again, the pitch period does not need to be transmitted. But because the LP residual signal contains no glottal impulse in the second and also in the third subframe, the adaptive codebook search is useless for these two subframes. Again, the first stage excitation signal in these subframes is replaced by zeros and saved bits are used for the FCB size increase so that all subframes can benefit and use the 20-bits FCBs.
  • the pitch period value is transmitted only in the fourth subframe and its minimum value is (3 N - k). The maximum value of the pitch period is limited by T mgx . It does not matter if the second glottal impulse appears in the fourth subframe or not (the second glottal impulse can be present in the next frame if k'+ T max ⁇ N ).
  • the absolute value of the pitch period is used at the decoder for the frame concealment; therefore this absolute value of the pitch period is transmitted in the situation when the second glottal impulse appears in the next frame.
  • the correct knowledge of the pitch period value from the frames m-1 and m+1 helps to reconstruct the missing part of the synthesis signal in the frame m successfully.
  • Configuration TRANSITION_2 ( Figure 24) -
  • the pitch period is transmitted only in the third and fourth subframes. In this case, only fixed codebook parameters are transmitted in the first subframe.
  • the frame shown in Figure 24 supposes the configuration when TM is not used in voiced onset frames. If TM is used also in the voiced onset frames, the configuration TRANSITION_2a is used where the pitch period T 0 is transmitted in the second subframe for using the procedure as described above.
  • the pitch period is still transmitted for the third subframe in the bit stream.
  • the TM coding technique is not used to encode the voiced onset frames. This value is useful only when voiced onset frames are encoded using the TM coding technique.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

There is provided a transition mode device and method for use in a predictive-type sound signal codec for producing a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in the sound signal, comprising an input for receiving a codebook index and a transition mode codebook for generating a set of codevectors independent from past excitation. The transition mode codebook is responsive to the index for generating, in the transition frame and/or frame following the transition, one of the codevectors of the set corresponding to the transition mode excitation. There is also provided an encoding device and method and a decoding device and method using the above described transition mode device and method.

Description

TITLE OF THE INVENTION
METHOD AND DEVICE FOR CODING TRANSITION FRAMES IN SPEECH SIGNALS
FIELD OF THE INVENTION
[0001] The present invention relates to a technique for digitally encoding a sound signal, for example a speech or audio signal, in view of transmitting and synthesizing this sound signal.
[0002] More specifically, but not exclusively, the present invention relates a method and device for encoding transition frames and frames following the transition in a sound signal, for example a speech or audio signal, in order to reduce the error propagation at the decoder in case of frame erasure and/or to enhance coding efficiency mainly at the beginning of voiced segments (onset frames). In particular, the method and device replace the adaptive codebook typically used in predictive encoders by a codebook of, for example, glottal impulse shapes in transition frames and in frames following the transition. The glottal-shape codebook can be a fixed codebook independent of the past excitation whereby, once the frame erasure is over, the encoder and the decoder use the same excitation so that convergence to clean-channel synthesis is quite rapid. In onset frame coding in traditional CELP, the past excitation buffer is updated using the noise-like excitation of the previous unvoiced or inactive frame that is very different from the current excitation. On the other hand, the proposed technique can build the periodic part of the excitation very accurately.
BACKGROUND
[0003] A speech encoder converts a speech signal into a digital bit stream which is transmitted over a communication channel or stored in a storage medium. The speech signal is digitized, that is sampled and quantized with usually 16-bits per sample. The speech encoder has the role of representing these digital samples with a smaller number of bits while maintaining a good subjective speech quality. The speech decoder or synthesizer operates on the transmitted or stored bit stream and converts it back to a speech signal.
[0004] Code-Excited Linear Prediction (CELP) coding is one of the best prior art techniques for achieving a good compromise between subjective quality and bit rate. This coding technique forms the basis of several speech coding standards both in wireless and wireline applications. In CELP coding, the sampled speech signal is processed in successive blocks of M samples usually called frames, where M is a predetermined number corresponding typically to 10-30 ms. A linear prediction (LP) filter is computed and transmitted every frame. The computation of the LP filter typically needs a lookahead, a 5-15 ms speech segment from the subsequent frame. The M-sample frame is divided into smaller blocks called subframes. Usually the number of subframes is three or four resulting in 4-10 ms subframes. In each subframe, an excitation signal is usually obtained from two components, the past excitation and the innovative, fixed-codebook excitation. The component formed from the past excitation is often referred to as the adaptive codebook or pitch excitation. The parameters characterizing the excitation signal are coded and transmitted to the decoder, where the reconstructed excitation signal is used as the input of the LP filter.
[0005] CELP-type speech codecs rely heavily on prediction to achieve their high performance. The prediction used can be of different kinds but usually comprises the use of an adaptive codebook containing an excitation signal selected in past frames. A CELP encoder exploits the quasi periodicity of voiced speech signal by searching in the past excitation the segment most similar to the segment being currently encoded. The same past excitation signal is maintained also in the decoder. It is then sufficient for the encoder to send a delay parameter and a gain for the decoder to reconstruct the same excitation signal as is used in the encoder. The evolution (difference) between the previous speech segment and the currently encoded speech segment is further modeled using an innovation selected from a fixed codebook. The CELP technology will be described in more detail herein below.
[0006] A problem of strong prediction inherent in CELP-based speech coders appears in presence of transmission errors (erased frames or packets) when the state of the encoder and the decoder become desynchronized. Due to the prediction, the effect of an erased frame is thus not limited to the erased frame, but continues to propagate after the erasure, often during several following frames. Naturally, the perceptual impact can be very annoying.
[0007] Transitions from unvoiced speech segment to voiced speech segment (e.g. transition between a consonant or a period of inactive speech, and a vowel) or transitions between two different voiced segments (e.g. transitions between two vowels) are the most problematic cases for frame erasure concealment. When a transition from unvoiced speech segment to voiced speech segment (voiced onset) is lost, the frame right before the voiced onset frame is unvoiced or inactive and thus no meaningful periodic excitation is found in the buffer of the past excitation (adaptive codebook). At the encoder, the past periodic excitation builds up in the adaptive codebook during the onset frame, and the following voiced frame is encoded using this past periodic excitation. Most frame error concealment techniques use the information from the last correctly received frame to conceal the missing frame. When the onset frame is lost, the decoder past excitation buffer will be thus updated using the noise-like excitation of the previous frame (unvoiced or inactive frame). The periodic part of the excitation is thus completely missing in the adaptive codebook at the decoder after a lost voiced onset and it can take up to several frames for the decoder to recover from this loss.
[0008] A similar situation occurs in the case of lost voiced to voiced transition. In that case, the excitation stored in the adaptive codebook before the transition frame has typically very different characteristics from the excitation stored in the adaptive codebook after the transition. Again, as the decoder usually conceals the lost frame with the use of the past frame information, the state of the encoder and the decoder will be very different, and the synthesized signal can suffer from important distortion.
OBJECTS OF THE INVENTION
[0009] An object of the present invention is therefore to provide a method and device for encoding transition frames in a predictive speech and/or audio encoder in order to improve the encoder robustness against lost frames and/or improve the coding efficiency.
[0010] Another object of the present invention is to eliminate error propagation and increase coding efficiency in CELP-based codecs by replacing the inter-frame dependent adaptive codebook search by a non-predictive, for example glottal-shape, codebook search. This technique requires no extra delay, negligible additional complexity, and no increase in bit rate compared to traditional CELP encoding.
SUMMARY OF THE INVENTION
[0011] More specifically, in accordance with one aspect of the present invention, there is provided a transition mode method for use in a predictive-type sound signal codec for producing a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in the sound signal, comprising: providing a transition mode codebook for generating a set of codevectors independent from past excitation; supplying a codebook index to the transition mode codebook; and generating, by means of the transition mode codebook and in response to the codebook index, one of the codevectors of the set corresponding to the transition mode excitation. [0012] According to a second aspect of the present invention, there is provided a transition mode device for use in a predictive-type sound signal codec for producing a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in the sound signal, comprising an input for receiving a codebook index and a transition mode codebook for generating a set of codevectors independent from past excitation. The transition mode codebook is responsive to the index for generating, in the transition frame and/or frame following the transition, one of the codevectors of the set corresponding to said transition mode excitation.
[0013] According to a third aspect of the present invention, there is provided an encoding method for generating a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in a sound signal, comprising: generating a codebook search target signal; providing a transition mode codebook for generating a set of codevectors independent from past excitation, the codevectors of the set each corresponding to a respective transition mode excitation; searching the transition mode codebook for finding the codevector of the set corresponding to a transition mode excitation optimally corresponding to the codebook search target signal.
[0014] According to a fourth aspect of the present invention, there is provided an encoder device for generating a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in a sound signal, comprising: a generator of a codebook search target signal; a transition mode codebook for generating a set of codevectors independent from past excitation, the codevectors of the set each corresponding to a respective transition mode excitation; and a searcher of the transition mode codebook for finding the codevector of the set corresponding to a transition mode excitation optimally corresponding to the codebook search target signal. [0015] According to a fifth aspect of the present invention, there is provided a decoding method for generating a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in a sound signal, comprising: receiving a codebook index; supplying the codebook index to a transition mode codebook for generating a set of codevectors independent from past excitation; and generating, by means of the transition mode codebook and in response to the codebook index, one of the codevectors of the set corresponding to the transition mode excitation.
[0016] According to a sixth aspect of the present invention, there is provided a decoder device for generating a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in a sound signal, comprising an input for receiving a codebook index and a transition mode codebook for generating a set of codevectors independent from past excitation. The transition mode codebook is responsive to the index for generating in the transition frame and/or frame following the transition one of the codevectors of the set corresponding to the transition mode excitation.
[0017] The foregoing and other objects, advantages and features of the present invention will become more apparent upon reading of the following non- restrictive description of an illustrative embodiment thereof, given by way of example only with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] In the appended drawings:
[0019] Figure 1a is a schematic block diagram of a CELP-based encoder; [0020] Figure 1b is a schematic block diagram of a CELP-based decoder;
[0021] Figure 2 is a schematic block diagram of a frame classification state machine for erasure concealment;
[0022] Figure 3 is an example of segment of a speech signal with one voiced transition frame and one onset frame;
[0023] Figure 4 is a functional block diagram illustrating a classification rule to select TM (Transition Mode) frames in speech onsets, where N_TM_FRAMES stands for a number of consecutive frames to prevent using a TM coding technique, 'clas' stands for a frame class, and VOICED_TYPE means ONSET, VOICED and VOICED TRANSITION classes;
[0024] Figure 5a is a schematic illustration of an example of frame of a speech signal divided into four (4) subframes, showing the speech signal in the time domain;
[0025] Figure 5b is a schematic illustration of an example of frame of a speech signal divided into four (4) subframes, showing a LP residual signal;
[0026] Figure 5c is a schematic illustration of an example of frame of a speech signal divided into four (4) subframes, showing a first stage excitation signal constructed using the TM coding technique in the encoder;
[0027] Figure 6 show graphs illustrating eight glottal impulses with 17- sample length used for the glottal-shape codebook construction, wherein the x-axis denotes a discrete time index and the y-axis an amplitude of the impulse; [0028] Figure 7 is a schematic block diagram of an example of TM portion of a CELP encoder, where k' represents a glottal-shape codebook index and G(z) is a shaping filter;
[0029] Figure 8 is a graphical representation of the computation of CW, the square root of the numerator in the criterion of Equation (16), wherein shaded portions of the vector/matrix are non-zero;
[0030] Figure 9 is a graphical representation of the computation of Ek', the denominator of the criterion of Equation (16) ), wherein shaded portions of the vector/matrix are non-zero;
[0031] Figure 10 is a graphical representation of the computation of the convolution matrix Zτ ; in this example the shaping filter G(z) has only three (3) non-zero coefficients (L1/2 = 1);
[0032] Figure 11 is a schematic block diagram of an example of TM portion of a CELP decoder;
[0033] Figure 12a is a schematic block diagram an example of structure of the filter Q(z);
[0034] Figure 12b is a graph of an example of glottal-shape codevector modification, wherein the repeated impulse is dotted;
[0035] Figure 13 is a schematic block diagram of the TM portion of a
CELP encoder including the filter Q(z);
[0036] Figure 14 is a graph illustrating a glottal-shape codevector with two-impulses construction when an adaptive codebook search is used in a part of the subframe with a glottal-shape codebook search; [0037] Figure 15 is a graph illustrating a glottal-shape codevector construction in the case where the second glottal impulse appears in the first L1/2 positions of the next subframe;
[0038] Figure 16 is a schematic block diagram of the TM portion of an encoder used in a EV-VBR (Embedded Variable Bit Rate) codec implementation;
[0039] Figure 17a is a graph showing an example of speech signal in the time domain;
[0040] Figure 17b is a graph showing a LP residual signal corresponding to the speech signal of Figure 17a;
[0041] Figure 17c is a graph showing a first-stage excitation signal in error-free conditions;
[0042] Figures 18a-18c are graphs illustrating an example of onset construction comparison, wherein the graph of Figure 18a represents the input speech signal, the graph of Figure 18b represents the output synthesized speech of a EV-VBR codec without the TM coding technique, and the graph of Figure 18c represents the output synthesized speech of a EV-VBR codec with the TM coding technique;
[0043] Figure 19a-19c are graphs illustrating an example of the effect of the TM coding technique in the case of frame erasure, wherein the graph of Figure 19a represents the input speech signal, the graph of Figure 19b represents the output synthesized speech of a EV-VBR codec without the TM coding technique, and the graph of Figure 19c represents the output synthesized speech of a EV- VBR codec with the TM coding technique; [0044] Figure 20 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITION_1_1 ;
[0045] Figure 21 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITION_1_2;
[0046] Figure 22 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITIONS _3;
[0047] Figure 23 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITION_1_4;
[0048] Figure 24 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITION_2;
[0049] Figure 25 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITION_3;
[0050] Figure 26 is a graph illustrating an example of the first-stage excitation signal in one frame of the configuration TRANSITION_4; and
[0051] Figure 27 is a schematic block diagram of a speech communication system illustrating the use of speech encoding and decoding devices.
DETAILED DESCRIPTION
[0052] The non-restrictive illustrative embodiment of the present invention is concerned with a method and device whose purpose is to overcome error propagation in the above described situations and increase the coding efficiency.
[0053] More specifically, the method and device according to the non- restrictive illustrative embodiment of the present invention implement a special encoding, called transition mode (TM) encoding technique, of transition frames and frames following the transition in a sound signal, for example a speech or audio signal. The TM coding technique replaces the adaptive codebook of the CELP codec by a new codebook of glottal impulse shapes, hereinafter designated as glottal-shape codebook, in transition frames and in frames following the transition. The glottal-shape codebook is a fixed codebook independent of the past excitation. Consequently, once a frame erasure is over, the encoder and the decoder use the same excitation whereby convergence to clean-channel synthesis is quite rapid.
[0054] While the use of the TM coding technique in frames following a transition helps to prevent error propagation in the case the transition frame is lost, another purpose of using the TM coding technique also in the transition frame is to improve the coding efficiency. For example, just before a voiced onset, the adaptive codebook usually contains a noise-like signal not very efficient for encoding the beginning of a voiced segment. The idea behind the TM coding technique is thus to supplement the adaptive codebook with a better codebook populated with simplified quantized versions of glottal impulses to encode the voiced onsets.
[0055] The proposed TM coding technique can be used in any CELP- type codec or predictive codec. As an example, the TM coding technique is implemented in a candidate codec in ITU-T standardization activity for an Embedded Variable Bit Rate Codec that will be referred to in the remaining of the text as EV-VBR codec. Although the non-restrictive illustrative embodiment of the present invention will be described in connection with the EV-VBR codec framework, it should be kept in mind that the principles and concepts of the present invention are not limited to an application to the EV-VBR codec but to any other codec using predictive coding. Also, although the non-restrictive illustrative embodiment of the present invention will be described in connection with a speech signal, it should be kept in mind that the present invention is not limited to an application to speech signals but its principles and concepts can be applied to any other types of sound signals including audio signals.
[0056] A speech frame can be roughly classified into one of the four (4) following speech classes (this will be explained in more detail in the following description):
Inactive frames characterized by the absence of speech activity;
Unvoiced speech frames characterized by an aperiodic structure and energy concentration toward higher frequencies;
- Voiced speech frames having a clear quasi-periodic nature with energy concentrated mainly in low frequencies; and
- Any other frame classified as a transition having rapidly varying characteristics.
[0057] In the EV-VBR codec, a specialized coding mode has been designed for each of the classes. It can be generally stated that the inactive frames are processed through comfort noise generation, the unvoiced speech frames through an optimized unvoiced encoding mode, the voiced speech frames through an optimized voiced encoding mode and all other frames are processed with a generic Algebraic CELP (ACELP) technology. In the EV-VBR codec framework, the TM coding technique is thus introduced as yet another encoding mode in the EV- VBR encoding scheme to encode transition frames and frames following the transition. [0058] Figure 27 is a schematic block diagram of a speech communication system depicting the use of speech encoding and decoding. The speech communication system supports transmission and reproduction of a speech signal across a communication channel 905. Although it may comprise, for example, a wire, optical or fiber link, the communication channel 905 typically comprises at least in part a radio frequency link. The radio frequency link often supports multiple, simultaneous speech communications requiring shared bandwidth resources such as may be found with cellular telephony. Although not shown, the communication channel 905 may be replaced by a storage device in a single device embodiment of the communication system that records and stores the encoded speech signal for later playback.
[0059] Still referring to Figure 27, a microphone 901 produces an analog speech signal that is supplied to an analog-to-digital (A/D) converter 902 for converting it into a digital form. A speech encoder 903 encodes the digital speech signal thereby producing a set of encoding parameters that are coded into a binary form and delivered to a channel encoder 904. The optional channel encoder adds redundancy to the binary representation of the coding parameters before transmitting them over the communication channel 905. On the receiver side, a channel decoder 906 utilizes the above mentioned redundant information in the received bit stream to detect and correct channel errors that have occurred in the transmission. A speech decoder 907 converts the bit stream received from the channel decoder 906 back to a set of encoding parameters for creating a synthesized digital speech signal. The synthesized digital speech signal reconstructed in the speech decoder 907 is converted to an analog form in a digital- to-analog (D/A) converter 908 and played back in a loudspeaker unit 909.
Short Background on CELP
[0060] A speech codec consists of two basic parts: an encoder and a decoder. The encoder digitizes the audio signal, chooses a limited number of encoding parameters representing the speech signal and converts these parameters into a digital bit stream that is transmitted to the decoder through a communication channel. The decoder reconstructs the speech signal to be as similar as possible to the original speech signal. Presently, a widespread speech encoding technique is based on Linear Prediction (LP), and more specifically on CELP technology. In LP-based coding, the speech signal is synthesized by filtering an excitation signal through an all-pole synthesis filter MA(z). In CELP, the excitation is typically composed of two parts, a first stage excitation signal is selected from an adaptive codebook and a second stage excitation signal is selected from a fixed codebook. Generally speaking, the adaptive codebook excitation models the periodic part of the excitation and the fixed codebook excitation is added to model the evolution of the speech signal.
[0061] The speech is normally processed by frames of typically 20 ms and the LP filter coefficients are transmitted once per frame. In CELP, every frame is further divided in several subframes to encode the excitation signal. The subframe length is typically 5 ms.
[0062] Referring to Figures 1a and 1b, the main principle behind CELP is called Analysis-by-Synthesis where possible decoder outputs are tried (synthesis) already during the encoding process (analysis) and then compared to the original speech signal. The search minimizes a mean-squared error between the input speech signal s(n) and the synthesized speech signal s'(n) in a perceptually weighted domain, where discrete time index n = 0, 1 , ..., Λ/-1 , and Λ/ is the length of the subframe. The perceptual weighting filter W(z) exploits the frequency masking effect and is typically derived from the LP filter. An example of perceptual weighting filter W(z) is given in the following Equation (1):
W(z) = A(z /^ , A(ZIy1) (1) where factors y-i and γ2 control the amount of perceptual weighting and holds the relation 0 < γ2 < Ki 1 ■ This traditional perceptual weighting filter works well for NB (narrowband - bandwidth of 200 - 3400 Hz) signals. An example of perceptual weighting filter for WB (wideband - bandwidth of 50 - 7000 Hz) signals can be found in Reference [1].
[0063] The bit stream transmitted to the decoder contains for the voiced frames the following encoding parameters: the quantized parameters of the LP synthesis filter, the adaptive and fixed codebook indices and the gains of the adaptive and fixed parts.
Adaptive Codebook Search
[0064] The adaptive codebook search in CELP-based codecs is performed in weighted speech domain to determine the delay (pitch period) t and the pitch gain gp, and to construct the quasi-periodic part of the excitation signal referred to as adaptive codevector v(n). The pitch period is strongly dependent on the particular speaker and its accurate determination critically influences the quality of the synthesized speech.
[0065] In a EV-VBR codec, a three-stage procedure is used to determine the pitch period and gain. In the first stage, three open-loop pitch estimates Top are computed for each frame - one estimate for each 10 ms half- frame and one for a 10 ms look-ahead - using the perceptually weighted speech signal sw(n) and normalized correlation computing. In the second stage, a closed- loop pitch search is performed for integer periods around the estimated open-loop pitch periods Top for every subframe. Once an optimum integer pitch period is found, a third search stage goes through the fractions around that optimum integer value. The closed-loop pitch search is performed by minimizing the mean-squared weighted error between the original and synthesized speech. This is achieved by maximizing the term
Figure imgf000017_0001
where xi(n) is the target signal and the first stage contribution signal (also called filtered adaptive codevector) yi(n) is computed by the convolution of the past excitation signal v(n) at period t with the impulse response h(n) of the weighted synthesis filter H(z)
y, (n) = v(n) * h(n) . (3)
[0066] The perceptually weighted input speech signal sw(n) is obtained by processing the input speech signal s(n) through the perceptual weighting filter W(z). The filter H(z) is formed by the cascade of the LP synthesis filter MA{z) and the perceptual weighting filter W(z). The target signal x-ι(n) corresponds to the perceptually weighted input speech signal sw(n) after subtracting therefrom the zero-input response of the filter H(z).
[0067] The pitch gain is found by minimizing the mean-squared error between the signal x-ι(n) and the first stage contribution signal y-ι(n). The pitch gain is expressed by the following Equation:
JV-I
∑x,(n)^, (n) Sp ~ Tm • (4)
∑^i(»)^i(«) n=0
[0068] The pitch gain is then bounded by 0 < gp < 1.2 and typically jointly quantized with the fixed codebook gain once the innovation is found. [0069] In CELP-based codecs, the excitation signal in the beginning of the currently processed frame is thus reconstructed from the excitation signal from the previous frame. This mechanism is very efficient for voiced segments of the speech signal where the signal is quasi-periodic, and in absence of transmission errors. In case of frame erasure, the excitation signal from the previous frame is lost and the respective adaptive codebooks of the encoder and decoder are no longer the same. In frames following the erasure, the decoder then continues to synthesize the speech using the adaptive codebook with incorrect content. Consequently, a frame erasure degrades the synthesized speech quality not only during the erased frame, but it can also degrade the synthesized speech quality during several subsequent frames. The traditional concealment techniques are often based on repeating the waveform of the previous correctly-transmitted frame, but these techniques work efficiently only in the signal parts where the characteristics of the speech signal are quasi stationary, for example in stable voiced segments. In this case, the difference between the respective adaptive codebooks of the encoder and decoder are often quite small and the quality of the synthesized signal is not much affected. However, if the erasure falls in a transition frame, the efficiency of these techniques is very limited. In communication systems using CELP-based codecs, where the Frame Erasure Rate (FER) is typically 3% to 5%, the synthesized speech quality then drops significantly.
[0070] Even in clean channel transmission, the efficiency of the adaptive codebook is limited in transition frames; the CELP encoder makes use of the adaptive codebook to exploit the periodicity in speech that is low or missing during transitions whereby the coding efficiency runs down. This is the case of voiced onsets in particular where the past excitation signal and the optimal excitation signal for the current frame are correlated very weakly or not at all.
Fixed Codebook Search [0071] The objective of the contribution of the Fixed (innovation)
CodeBook (FCB) search in CELP-based codecs is to minimize the residual error after the use of the adaptive codebook, i.e.
Figure imgf000019_0001
where gc is the fixed codebook gain, and the second stage contribution signal (also called as the filtered fixed codevector)
Figure imgf000019_0002
is the fixed codebook vector ck(n) convolved with h(n). The target signal Xi(/i) is updated by subtracting the adaptive codebook contribution from the adaptive codebook target to obtain:
x2(n) = x] (n) - gpy](n) . (6)
[0072] The fixed codebook can be realized for example by using an algebraic codebook as described in Reference [2]. If ck denotes the algebraic code vector at index k, then the algebraic codebook is searched by maximizing the following criterion:
Figure imgf000019_0003
where H is the lower triangular Toeplitz convolution matrix with diagonal /7(0) and lower diagonals /?(1 ), ... , A?(/V— 1 ). Vector d = Hτx2 is the correlation between the updated target signal x2(n) and h(n) (also known as backward filtered target vector), and matrix Φ = H7H is the matrix of correlations of h(n). The superscript 7 denotes matrix or vector transpose. Both d and Φ are usually computed prior to the fixed codebook search. Reference [1] discusses that, if the algebraic structure of the fixed codebook contains only a few non-zero elements, a computation of the maximization criterion for all possible indexes k is very fast. A similar procedure is used in the transition mode (TM) encoding technique as will be seen below.
[0073] CELP is believed to be otherwise well known to those of ordinary skill in the art and, for that reason, will not be further described in the present specification.
Frame Classification in the EV-VBR codec
[0074] The frame classification in the EV-VBR codec is based on VMR-
WB (Variable Rate Multi-Mode Wideband) classification as described in Reference [3]. VMR-WB classification is done with the consideration of the concealment and recovery strategy. In other words, any frame is classified in such a way that the concealment can be optimal if the following frame is missing, or that the recovery can be optimal if the previous frame was lost. Some of the classes used for frame erasure concealment processing need not be transmitted, as they can be deduced without ambiguity at the decoder. Five distinct classes are used, and defined as follows:
[0075] - UNVOICED class comprises all unvoiced speech frames and all frames without active speech. A voiced offset frame can be also classified as UNVOICED if its end tends to be unvoiced and the concealment designed for unvoiced frames can be used for the following frame in case it is lost.
[0076] - UNVOICED TRANSITION class comprises unvoiced frames with a possible voiced onset at the end. The voiced onset is however still too short or not built well enough to use the concealment designed for voiced frames. An UNVOICED TRANSITION frame can follow only a frame classified as UNVOICED or UNVOICED TRANSITION. [0077] - VOICED TRANSITION class comprises voiced frames with relatively weak voiced characteristics. Those are typically voiced frames with rapidly changing characteristics (transitions between vowels) or voiced offsets lasting the whole frame. A VOICED TRANSITION frame can follow only a frame classified as VOICED TRANSITION, VOICED or ONSET.
[0078] - VOICED class comprises voiced frames with stable characteristics. A VOICED frame can follow only a frame classified as VOICED TRANSITION, VOICED or ONSET.
[0079] - ONSET class comprises all voiced frames with stable characteristics following a frame classified as UNVOICED or UNVOICED TRANSITION. Frames classified as ONSET correspond to voiced onset frames where the onset is already sufficiently well built for the use of the concealment designed for lost voiced frames. The concealment techniques used for a frame erasure following a frame classified as ONSET are in traditional CELP-based codecs the same as following a frame classified as VOICED, the difference being in the recovery strategy when a special technique can be used to artificially reconstruct the lost onset. According to the non-restrictive illustrative embodiment of the present invention, the TM coding technique is successfully used in this case.
[0080] The classification state diagram is outlined in Figure 2. The classification information is transmitted using 2 bits. As it can be seen from Figure 2, the UNVOICED TRANSITION class and VOICED TRANSITION class can be grouped together as they can be unambiguously differentiated at the decoder (an UNVOICED TRANSITION frame can follow only UNVOICED or UNVOICED TRANSITION frames, a VOICED TRANSITION frame can follow only ONSET, VOICED or VOICED TRANSITION frames).
[0081] The following parameters are used for the classification: a
normalized correlation xy , a spectral tilt measure e't, a pitch stability counter pc, a relative frame energy of the speech signal at the end of the current frame Erel and a zero-crossing counter zc. As can be seen in the following detailed analysis, the computation of these parameters uses a lookahead. The lookahead allows the evolution of the speech signal in the following frame to be estimated and, consequently, the classification can be done by taking into account the future speech signal behaviour.
R ' [0082] The average normalized correlation xy is computed as a mean of the maximum normalized correlation of the second half-frame and the lookahead using the following Equation:
R'Xy = 0.5(Cnorm (d] ) + Cnorm (d2 )) . (8)
[0083] The maximum normalized correlations Cnom are computed as a part of the open-loop pitch search and correspond to the maximized normalized correlations of two adjacent pitch periods of the weighted speech signal.
[0084] The spectral tilt parameter e't contains the information about the frequency distribution of energy. The spectral tilt for one spectral analysis is estimated as a ratio between the energy concentrated in low frequencies and the energy concentrated in high frequencies. Here, the tilt measure used is the average in the logarithmic domain of the spectral tilt measures e"" ^ and e"" ^ defined as a low and high frequency energies ratio. That is:
e', = I0log(e,,,, (0)e,ll, (I)) . (9)
[0085] The pitch stability counter pc assesses the variation of the pitch period. It is computed as follows: PC =1 TO[A - TopQ I + 1 Top2 - TopQ I . (10)
[0086] The values Top0, Top1, and T0P2 correspond to the open-loop pitch estimates from the first half of the current frame, the second half of the current frame and the lookahead, respectively.
[0087] The relative frame energy Eret is computed as a difference in dB between the current frame energy and the long-term active-speech energy average.
[0088] The last parameter is the zero-crossing parameter zc computed on a 20 ms segment of the speech signal. The segment starts in the middle of the current frame and uses two subframes of the lookahead. Here, the zero-crossing counter zc counts the number of times the speech signal sign changes from positive to negative during that interval.
[0089] To make the classification more robust, the classification parameters are considered together forming a function of merit fm. For that purpose, the classification parameters are first scaled between 0 and 1 so that parameter's value typical for unvoiced speech signal translates into 0 and each parameter's value typical for voiced speech signal translates into 1. A linear function is used between them. The scaled version ps of a certain parameter px is obtained using the Equation:
Ps = kpPx + c p constrained by O ≤ p' ≤ l . (11)
[0090] The function coefficients kp and cp have been found experimentally for each of the parameters so that the signal distortion due to the concealment and recovery techniques used in presence of frame errors is minimal. The values used are summarized in Table 1.
Table 1 - Signal Classification Parameters and the coefficients of their respective scaling functions.
Figure imgf000024_0001
[0091] Then the function of merit fm has been defined as:
fm = _- /(Λ2R n; sy + , e „;* + pc' + E;cl + ZCi ) , (12)
where the superscript s indicates the scaled version of the parameters.
[0092] A first classification decision is made for the UNVOICED class as follows:
If (local_VAD=0) OR (Ereι<-8) then class = UNVOICED. (13)
where local_VAD stands for local Voice Activity Detection. [0093] If the above condition (13) is not satisfied, then the classification proceeds using the function of merit fm and following the rules summarized in Table 2.
Table 2 - Signal Classification Rules at the Encoder.
Figure imgf000025_0001
[0094] The class information is encoded with two bits as explained herein above. Despite the fact that the supplementary information, which improves frame erasure concealment, is transmitted only in Generic frames, the classification is performed for each frame. This is needed to maintain the classification state machine up to date as it uses the information about the class of the previous frame. The classification is however straightforward for encoding types dedicated to UNVOICED or VOICED frames. Hence, voiced frames are always classified as VOICED and unvoiced frames are always classified as UNVOICED. Frame Selection for TM coding
[0095] As discussed previously, the technique being described replaces the adaptive codebook in CELP-based coders by a glottal-shape codebook to improve the robustness to frame erasures and to enhance the coding efficiency when non-stationary speech frames are processed. This means that this technique does not construct the first stage excitation signal with the use of the past excitation, but selects the first stage excitation signal from the glottal-shape codebook. The second stage excitation signal (the innovation part of the total excitation) is still selected from the traditional CELP fixed codebook. Any of these codebooks use no information from the past (previously transmitted) speech frames, thereby eliminating the main reason for frame error propagation inherent to CELP-based encoders.
[0096] Using the TM coding technique systematically (to encode all frames) would greatly limit the error propagation, but the coding efficiency and the synthesized speech quality would drop in error-free conditions. As a compromise between the clean-channel performance of the codec and its robustness to channel errors, the TM coding technique can be applied only to the transition frames and to several frames following each transition frame. For frame erasure robustness, the TM coding technique can be used for voiced speech frames following transitions. As introduced previously, these transitions comprise basically the voiced onsets and the transitions between two different voiced sounds. To select pertinent frames to be encoded using the TM coding technique, transitions are detected. While any detector of transitions can be used, the non-restrictive illustrative embodiment uses the classification of the EV- VBR framework as described herein above.
[0097] The TM coding technique can be applied to encode transition
(voiced onset or transition between two different voiced sounds) frames as described above and several subsequent frames. The number of TM frames (frames encoded using the TM coding technique) is a matter of compromise between the codec performance in clean-channel conditions and in conditions with channel errors. If only the transition (voiced onset or transition between two different voiced sounds) frames are encoded using the TM coding technique, the encoding efficiency increases. This increase can be measured by the increase of the segmental signal-to-noise ratio [SNR), for example. The SNR is computed using the following Equation:
SNR = ^- (14)
where Esd is the energy of the input speech signal of the current frame and Ee is the energy of the error between this input speech signal and the synthesis speech signal of the current frame.
[0098] However, using the TM coding technique to encode only the transition frames does not help too much for error robustness; if the transition (voiced onset or transition between two different voiced sounds) frame is lost, the error will propagate as the following frames would be coded using the standard CELP procedure. On the other hand, if the frame preceding the transition (voiced onset or transition between two different voiced sounds) frame is lost, the effect of this lost preceding frame on the performance is not critical even without the use of the TM coding technique. In the case of voiced onset transitions, the frame preceding the onset is likely to be unvoiced and the adaptive codebook contribution is not much important. In the case of a transition between two voiced sounds, the frame before the transition is generally fairly stationary and the adaptive codebook state in the encoder and the decoder are often similar after the frame erasure.
[0099] To increase the robustness, frames following the transition
(voiced onset or transition between two different voiced sounds) can be encoded using the TM coding technique. If the clean-channel performance enhancement is not important, the TM coding technique can be used only in the frames following the transition frames. Basically, the number of consecutive TM frames depends on the number of consecutive frame erasures one wants to consider for protection. If only isolated erasures are considered (i.e. one isolated frame erasure at a time), it is sufficient to encode only the frame following the transition (voiced onset or transition between two different voiced sounds) frame. If the transition (voiced onset or transition between two different voiced sounds) frame is lost, the following frame is encoded without the use of the past excitation signal and the error propagation is broken. It should be pointed out, however, that if the transition (voiced onset or transition between two different voiced sounds) frame is transmitted correctly but the following frame is lost, the error propagation would not be prevented as the next frame is already using classical CELP encoding. However, the distortion will likely be limited if at least one pitch period is already well built at the end of the transition (voiced onset or transition between two different voiced sounds) as shown in Figure 3.
[00100] When an implementation of the TM coding technique is done into some existing codec and the class of the current frame and the coding mode are known, the following scheme to set the onset and the following frames for TM coding can be used. A parameter state that is a counter of the consecutive TM frames previously used is stored in the encoder state memory. If the value of this parameter state is negative, TM coding cannot be used. If the parameter state is not negative but lower or equal to the number of consecutive frame erasures to protect, and the class of the frame is ONSET, VOICED or VOICED TRANSITION, the frame is denoted as TM frame (see Figure 4 for more detail). In other words, the frame is denoted as TM frame if N_TM_FRAMES > state > 0, where N_TM_FRAMES is a number of consecutive frames to prevent using the TM coding technique.
[00101] If it is expected that the communication channel characteristics are such that more than one isolated frame are often erased at a time, i.e. that the frame erasures have the tendency to appear in bundles, the best solution might be to use the TM coding technique to protect two or even more consecutive frame erasures. However, the coding efficiency in clean-channel conditions will drop. If a feedback about the channel is available in the encoder, the number of the consecutive TM frames might be made adaptive to the conditions of transmission. In the non-restrictive illustrative embodiment of the present invention, up to two TM frames following the transition (voiced onset or transition between two different voiced sounds) frame are considered, which corresponds to a design able to cope with up to two consecutive frame erasures.
[00102] The above described decision uses basically a fixed number
(whether this number is fixed before the transmission or is dependent on channel conditions of transmission) of TM frames following the transition (voiced onset or transition between two different voiced sounds) frame. The compromise between the clean-channel performance and the frame-error robustness can be also based on a closed-loop classification. More specifically, in the frame that one wants to protect against the previous frame erasure or wants to decide if it is the onset frame, a computation of the two possible coding modes is done in parallel; the frame is processed both using the generic (CELP) coding mode and the TM coding technique. Performance of both approaches is then compared using a SNR measure, for example; for more details the following Section entitled "TM coding Technique Performance in EV-VBR Codec". When the difference between the SNR for the generic (CELP) coding mode and the SNR for the TM coding technique is greater than a given threshold, the generic (CELP) coding mode is applied. If the difference between the SNR for the generic (CELP) coding mode and the SNR for the TM coding technique is smaller than the given threshold, the TM coding technique is applied. The value of the threshold is chosen depending on how strong the frame erasure protection and onset coding determination is required.
Subframe Selection for Glottal-Shape Codebook Search
[00103] In the previous Section, the reasons and mechanisms for selecting frames for coding using the TM coding technique was described. Now it will be shown that it is generally more efficient not to use the glottal-shape codebook in all subframes in order to achieve the best compromise between the clean-channel performance at a given bitrate and the performance in presence of an erasure in the frames preceding the TM frames. First, the glottal-shape codebook search is important only in the first pitch-period in a frame. The following pitch periods can be encoded using the more efficient standard adaptive codebook search since they no longer use the excitation of the past frame (when the adaptive codebook is searched, the excitation is searched up to about one pitch period in the past). There is consequently no reason to employ the glottal-shape codebook search in subframes containing no portion of the first pitch period of a frame.
[00104] Similarly, when the glottal-shape codebook search is used to increase the encoding efficiency in voiced onset frames, this glottal-shape codebook search is used on the first pitch period of the starting voiced segment. The reason is that for the first pitch period, the adaptive codebook contains a noise- like signal (the previous segment was not voiced) and replacing it with a quantized glottal impulse often increases the coding efficiency. For the following pitch periods however, the periodic excitation has already built up in the adaptive codebook and using this codebook will yield better results. For this reason, the information on the voiced onset position is available at least with subframe resolution.
[00105] Further optimization of the bit allocation concerns frames with pitch periods longer than the subframe length. Given that the glottal-shape codebook contains quantized shapes of the glottal impulse, the codebook is best suited to be used in subframes containing the glottal impulse. In other subframes, its efficiency is low. Given that the bit rate is often quite limited in speech encoding applications and that the encoding of the glottal-shape codebook requires a relatively larger number of bits for low bit rate speech encoding, a bit allocation where the glottal-shape codebook is used and searched only in one subframe per frame was chosen in the non-restrictive, illustrative embodiment. [00106] To choose the subframe to be encoded with the glottal-shape codebook, the first glottal impulse in the LP residual signal is looked for. The following simple procedure can be used. The maximum sample in the LP residual signal is searched in the range [0, 0 + Top+2], where TQp is the open-loop pitch period for the first half-frame and 0 corresponds to the frame beginning. In the case of voiced onset frames, and if the beginning of the onset can be reliably determined, 0 denotes the beginning of the subframe where the onset beginning is located. The glottal-shape codebook will be then employed in the subframe with the maximum residual signal energy. Moreover the position of the maximum gives information where the glottal impulse position can approximately be situated and this situation can be exploited for complexity reduction as will be discussed later. Note that as the glottal-shape codebook search replaces only the adaptive codebook search, a fixed codebook search is done in every subframe of a TM frame.
[00107] The other subframes (not encoded with the use of the glottal- shape codebook) will be processed as follows. If the subframe using glottal-shape codebook search is not the first subframe in the frame, the excitation signal in preceding subframe(s) of the frame is encoded using the fixed CELP codebook only; this means that the first stage excitation signal is zero. If the glottal-shape codebook subframe is not the last subframe in the frame, the following subframe(s) of the frame is/are processed using standard CELP encoding (i.e. using the adaptive and the fixed codebook search). In Figures 5a-5c, the situation is shown for the case where the first glottal impulse emerges in the 2nd subframe. In Figure 5b, u(n) is the LP residual signal. The first stage excitation signal is denoted qι<(n) when it is built using the glottal-shape codebook, or v(n) when it is built using the adaptive codebook. In this example (Figure 5c), the first stage excitation signal is zero in the 1st subframe, it is a glottal-shape codevector in the 2nd subframe and a adaptive codebook vector in the last two subframes.
[00108] In order to further increase coding efficiency and to optimize bit allocation, different processing is used in particular subframes of a TM frame dependent on the pitch period. When the first subframe is chosen as TM subframe, the subframe with the 2nd glottal impulse in the LP residual signal is determined. This determination is based on the pitch period value and the following four situations then can occur. In the first situation, the 2nd glottal impulse is in the 1st subframe, and the 2nd, 3rd and 4th subframes are processed using standard CELP encoding (adaptive and fixed codebook search). In the second situation, the 2nd glottal impulse is in the 2nd subframe, and the 2nd, 3rd and 4th subframes are processed using again standard CELP encoding. In the third situation, the 2nd glottal impulse is in the 3rd subframe. The 2nd subframe is processed using fixed codebook search only as there is no glottal impulse in the 2nd subframe of the LP residual signal to be searched for using the adaptive codebook. The 3rd and 4th subframes are processed using standard CELP encoding. In the last (fourth) situation, the 2nd glottal impulse is in the 4th, subframe (or in the next frame), the 2nd and 3rd subframes are processed using the fixed codebook search only, and the 4th, subframe is processed using standard CELP encoding. More detailed discussion is provided in an exemplary implementation later below.
[00109] Table 3 shows names of the possible coding configurations and their occurrence statistics. In other words, Table 3 gives the distribution of the first and the second glottal impulse occurrence in each subframe for frames processed with the TM coding technique. Table 3 corresponds to the scenario where the TM coding technique is used to encode only the voiced onset frame and one subsequent frame. The frame length of the speech signal in this experiment was 20 ms, the subframe length 5 ms and the experiment was conducted using voices of 32 men and 32 women (if not mentioned differently, the same speech database was used also in all other experiments mentioned in the following description).
Table 3 - Coding mode configurations for TM and their occurrence when speech signal is processed.
Figure imgf000033_0001
Glottal-Shape Codebook
[0100] In principle, the glottal-shape codebook consists of quantized normalized shapes of the glottal impulses placed at a specific position. Consequently, the codebook search consists both in the selection of the best shape, and in the determination of its best position in a particular subframe. In its simplest form, the shape of the glottal impulse can be represented by a unity impulse and does not need to be quantized. In that case, only its position in the subframe is determined. However the performance of such a simple codebook is very limited.
[0101] On the other hand, the best representation would be probably achieved if the length L of the glottal-shape codebook entries corresponds to the length of the pitch period, and if a large number of glottal impulse shapes are represented. As the length and the shape of the glottal impulses vary from speaker to speaker and from frame to frame, the complexity and memory requirements to search and store such a codebook would be too extensive. As a compromise, the length of the glottal impulses as well as their number must be limited. In the non- restrictive illustrative embodiment, the glottal-shape codebook is composed of eight (8) different glottal impulse shapes and the length of each glottal impulse is L = 17 samples. The quantized shapes have been selected such that the absolute maximum is around the middle of this length. During the glottal-shape codebook search, this middle is aligned with the index k' which represents the position of the glottal impulse in the current subframe and is chosen from the interval [0, N - 1], N being the subframe length. As the codebook entries length of 17 samples is shorter than the subframe length, the remaining samples are set to zero.
[0102] The glottal-shape codebook is designed to represent as many existent glottal impulses as possible. A training process based on the k-means algorithm [4] was used; the glottal-shape codebook was trained using more than three (3) hours of speech signal composed of utterances of many different speakers speaking in several different languages. From this database, the glottal impulses have been extracted from the LP residual signal and truncated to 17 samples around the maximum absolute value. From the sixteen (16) shapes selected by the k-means algorithm, the number of shapes has been further reduced to eight (8) shapes experimentally using a segmental SNR quality measure. The selected glottal-shape codebook is shown in Figure 6. Obviously, other means can be used to design the glottal-shape codebook. Glottal-Shape Codebook Search
[0103] The actual realization of the glottal-shape codebook can be done in several ways. For example, the search can be performed similar to the fixed codebook search in CELP. In this case the codebook is constructed by placing the center of the glottal impulse shapes at all possible positions in the subframe. For instance, for a subframe length of sixty-four (64) samples and eight (8) glottal impulse shapes, a glottal-shape codebook of size 64x8=512 codevectors is obtained. In accordance with another example, similarly to the adaptive codebook search, the codebook entries can be successively placed at all potential positions in the past excitation and the best shape/position combination can be selected in a similar way as is used in the adaptive codebook search. In the latter realization all pitch cycle repeating is automatically done through the long-term CELP filter and the glottal impulses are represented with full-sized shapes (in contrast to the first realization where glottal-shape truncation is necessary in border cases as will be discussed later).
[0104] The non-restrictive illustrative embodiment uses the configuration where the codebook search is similar to the fixed codebook search in Algebraic CELP (ACELP). In this approach, for each of the candidate shapes, the shape is represented as an impulse response of a shaping filter G(z). Thus the codevectors corresponding to glottal impulse shapes centered at different positions can be represented by codevectors containing only one non-zero element filtered through the shaping filter G(z) (for a subframe size Λ/ there are Λ/ single-pulse vectors for potential glottal impulse positions k).
[0105] Because of the glottal impulse position k' is in the middle of the glottal shape with an odd length of L samples and k' is from the range [0, Λ/-1], the glottal shape must be truncated for the first and for the last Ly2 = (H)/2 samples. This will be taken into consideration during the glottal pulse search since it makes the shaping filter G(z) a non-causal filter. [0106] The configuration of the TM part is shown in Figure 7 for the encoder and in Figure 11 for the decoder. As already mentioned, the TM part replaces the adaptive codebook part of the encoder/decoder. During the search, the impulse response of the shaping filter G(z) can be integrated to the impulse response of the filter H(z).
[0107] A procedure and corresponding codebook searcher for searching the optimum glottal impulse center position W for a certain shape of the glottal impulse rendered by the shaping filter G(z) will now be described. Because the shape of the filter G(z) is chosen from several candidate shapes (eight (8) shapes are used in the non-restrictive illustrative embodiment as illustrated in Figure 6), the search procedure must be repeated for each glottal shape of the codebook in order to find the optimum impulse shape and position.
[0108] To determine TM coding parameters, the search determines the mean-squared error between the target vector X1 and the glottal-shape codevector centered at position k' that is filtered through the weighted synthesis filter H(z). Similar to CELP, the search can be performed by finding the maximum of a criterion in the form:
Figure imgf000036_0001
where y? is the filtered glottal-shape codevector. Let qk denote the glottal-shape codevector centered at position k' and pk a position codevector with one (1) nonzero element indicating the position k\ then qk can be written as qk = Gφk, where G is a Toeplitz matrix representing the shape of the glottal impulse. Therefore, similar to the fixed codebook search, the following Equation can be written: (x?y, )2 (x.' Hq,)2 _ (x[HGpt, )2
3*- = f y, q[H Hq, pk TGrUTHGpk
(16)
_ fezpj _ fcp,)2 (Cj p[,ZrZp4. p£.Φ_pt.
where H is the lower triangular Toeplitz convolution matrix of the weighted synthesis filter. As will be discussed later, the rows of the matrix Zτ correspond to the filtered shifted version of the glottal impulse shape or its truncated representation. Note that all vectors in this text are supposed column vectors (N x 1 matrices).
[0109] An example of matrix G in transpose form (τ) for an impulse length of three (3) samples and N = 4 would have the form:
Figure imgf000037_0001
where g(n) are the coefficients of the impulse response of the non-causal shaping filter G(z). In the following description, the coefficients of the non-causal shaping filter G(z) axe given by the values g(n), for n located within the range [-Ly2 , L1/2]. Because of the fact that the position codevector p^ has only one non-zero element, the computation of the criterion (16) is very simple and can be expressed using the following Equation:
3, = (18)
ΦSk'Λ') [0110] As it can be seen from Equation (18), only the diagonal of the matrix Φg needs to be computed.
[0111] A graphical representation of computing the criterion (18) for one glottal-shape codevector is shown in Figures 8 and 9. As it has been already mentioned, the Equation (18) is typically used in the ACELP algebraic codebook search by precomputing the backward filtered target vector dg and the correlation matrix Φg . However, given the non-causal nature of the shaping filter G(z), this cannot be directly applied for the first Ly2 positions. In these situations a more sophisticated search is used where some computed values can still be reused to maintain the complexity at a low level. This will be described hereinafter.
[0112] Let us denote zkΛo be the (/c'+i)"7 row of the matrix Zτ, where the matrix Zτ (Figure 10) is computed as follows. Given the non-causal nature of the shaping filter G(z), the matrix Zτ is computed in two stages to minimize the computational complexity. The first Ly2 + 1 rows of this matrix are first computed. For the remaining part of the matrix Zτ (the last N - Ly2 - 1 rows of the matrix Z7), the criterion (18) is used in a manner similar to the ACELP fixed codebook search.
[0113] A detailed description of how to compute the matrix Zτ and the criterion (18) will now be described.
[0114] In the first stage, the first LV2 + 1 rows of the matrix Zτ that correspond to the positions W within the range [0, LV2] are computed. For these positions a different truncated glottal shape is used for each position W within this range. In a first operation, a convolution between the glottal-shape response for position W - 0 and the impulse response h(n) is computed using the Equation:
n
*o(«) = ∑g(" - OΛ(0 . (19) i=0 where advantage is taken of the fact that the shaping filter G(z) has only Ly2 + 1 non-zero coefficients, i.e. g(0), g(1), ..., g(LV2) are non-zero coefficients.
[0115] In a second operation, the convolution zλ{h) between the glottal- shape codebook response for position k' = 1 and the impulse response H(z) is computed reusing values of zo{n) as follows (the matrix Zτ = Gτ Hτ is a matrix with some zero negative-sloping diagonals, but this matrix Zr is no longer a Toeplitz and triangular matrix as shown in Figure 10):
*.(0) = g(-l)Λ(0)
(20) zx(n) = z0 (n -\) + g(-l)h(n) for AJ = 1 /V - 1.
[0116] For the following rows, the recursion in Equation (21) is reused:
z,{0) = g(-k')h{0)
(21) zk, (ή) = zk,_x (W - I) + girk')h{n) for n = 1 Λ/ - 1
[0117] The recursion (21) is repeated for all k' ≤ LV2- For k' = LV2 the shaping filter G(z) has already L non-zero coefficients and the (L1/2+1)th row of the matrix Zτ is thus obtained by
Figure imgf000039_0001
(22) 2Un M = z/1/2-i (" " *) + S(-Lm)h(n) for A7 = 1 , ... , N - 1. [0118] At this point, the first L1/2 + 1 rows of the matrix Zτ have been computed. These rows comprise no zero coefficients (Figure 10). Then the criterion (18) can be computed for k' within the range [0, L1/2] using the Equation:
Figure imgf000040_0001
[0119] In the second stage the rest of the matrix Zτ is computed and the criterion (18) is evaluated for positions k' within the range [L1^ + 1 , N - 1]. Advantage is taken of the fact that rows LV2 + 1 , ..., N - I of the matrix Zτ are built using coefficients of the convolution zLιn {n) that have already been computed as described by the equation (22). The difference is that only a part of the coefficients is needed to compute these rows. That is, each row corresponds to the previous row shifted to the right by 1 and adding a zero at the beginning:
z,(0) = 0
(24) zk,{n) = zk,_λ (n - \) for A? = 1 /V - 1 .
[0120] This is repeated for k' within the range [LV2 + 1 , N - 1].
[0121] In this second stage, the criterion (18) can be computed in a manner similar to that described in the above section Fixed codebook search to further reduce the computational complexity. The criterion (18) is first evaluated for the last position k' = N - 1 (this is the last row of the matrix Z7). For k' = N - 1 the numerator and the denominator of the criterion (18) is provided by the following Equation
Figure imgf000041_0001
and
Φg(N-l,N -»-£ Z, (i)z, (/). (26)
;=0
[0122] Since some of the coefficients of the matrix Zτ are zeros
(Figure 10), only LV2 + 1 multiplications (instead of the Λ/ multiplications as used in Equation (23)) are used to compute the numerator and the denominator of the criterion (18).
[0123] When using the example of Figure 10 (L1/2 = 1) the criterion (18), computed using equations (25) and (26), can be simplified as follows:
{dg(N-l)f _(x(N-2)zLιn(O) + x(N-l)zlΛi2(\)} "-' Φg (N -IN-I) Z,/2(O)z,/2(O) + z,/2 (1K1/2(1)
[0124] In the next steps some of previously computed values can be again reused for the denominator computation. For the position /V -2 the denominator of the criterion (18) is computed using
Φg(N-2,N-2) = Φg(N-\,N-\) + zLιi2(Lm+\)zLιn(Lυ2+\). (28)
[0125] The numerator is computed using Equation (25) with the summation index changed: dg{N - 2) = '∑x(N - 2 - Lm + i)z ^n (0 . (29)
/=0
[0126] In a similar manner, the numerator and the denominator of criterion (18) are calculated for all positions k' > L1/2.
[0127] The above described procedure allows to find the maximum of the criterion (18) for codevectors that represent the first shape from the glottal impulses. The search will continue using the previously described procedure for all other glottal impulse shapes. The maximum of criterion (18) search continues as glottal-shape codebook search to find one maximum value for criterion (18) that corresponds to the one glottal-shape and one position W constituting the result of the search.
[0128] It is also possible to use sub-sample resolution when searching the glottal pulse center position k'\ this will, however, result in increased complexity. More specifically, this will require up-sampling the glottal impulse shapes to increase the resolution and extracting different shifted versions at different resolutions. This is equivalent to using a larger glottal shape codebook.
[0129] Ideally the criterion (18) is computed for all possible glottal impulse positions k'. In the non-restrictive illustrative embodiment, the search is performed only in a restrained range around the expected position of the position k' to further reduce the computational complexity. This expected position is in the range [kmin, kmax], 0 < kmin < kmax < N, and can be determined for the first glottal shape from the LP residual signal maximum found as described in the above Section Subframe Selection for Glottal-Shape Codebook Search. A glottal-shape codebook search is then performed and position k' is found for the first glottal shape. The new range [kmin, kmax] is set for the second glottal shape search as follows: kmin — K — Δ,
(30) kmax = k' + Δ.
[0130] Typically Δ = 4. Similarly, Equation (30) is used to define the search range for the third shape around the selected position of the second shape and so on.
[0131] In the following example, it is supposed that the initial search range is [N - 15, N - 7], L = 17 and N = 64. The search starts with computing the
value Z/i '2^ ' . Then the criterion (18) for the position k' = N- 7 is evaluated using
Figure imgf000043_0001
[0132] To compute the criterion for position k' = N- 8, the denominator is recursively computed as:
Φg(N-S,N-S) = Φg(N-l,N-l) + zLu2(L,/2 +8)zLιi2(Lm +S). (32)
[0133] In the same manner, the denominator is computed for all remaining positions until k' = N - 15. The numerator of criterion (18) is computed for every position within the range [N- 15, /V- 7] separately in a manner similar to Equation (29) using: N-k'+ + L1 n-I
.(*') = V ∑^ xik' - L, /2 + ϊ)zL (J) (33) i=0
[0134] The last parameter to be determined in the glottal-shape codebook search is the gain gp that can be computed as in Equation (4) with the difference that it is not bounded as in the adaptive codebook search. The reason is that the filtered glottal-shape codevector is constructed using normalized quantized glottal shapes with energy very different from the energy of the actual excitation signal impulses.
[0135] The indices related to the glottal impulse position and the glottal shape are transmitted to the decoder. The filtered glottal-shape codevector reconstruction in the decoder is shown in Figure 11. It should be noted that the pitch period length no longer needs to be transmitted in a glottal-shape codebook search subframe with the exception when the subframe contains more than one glottal impulse as will be discussed hereinafter.
More Glottal Impulses in One Subframe
[0136] There are situations where the pitch period of the speech signal is shorter than the subframe length and in this case the subframe can contain more than one glottal impulse (especially in the configuration TRANSITION_1_1). In this case it is necessary to model all the glottal impulses. Given the pitch period length limitations and the subframe length, a subframe cannot contain more than two glottal impulses in this non-restrictive illustrative embodiment.
[0137] These situations can be solved by two different approaches. The first and simpler one solves these situations by means of a similar procedure as the periodicity enhancement (pitch sharpening) used in AMR-WB (Adaptive Multi-Rate Wideband) as described in Reference [1], where the impulse is basically repeated with the pitch period using a linear filter. As illustrated in Figure 12a, the glottal- shape codevector qk{n) is thus processed through an adaptive, repetition filter of the form:
Q(z) = - ^r . (34)
\ -a - z °
[0138] The pitch period TQ can be determined for example by the standard closed-loop pitch search approach. The parameter a impacts the energy of the second impulse and, in the non-restrictive illustrative embodiment, has been set to a = 0-85. This technique adds the missing glottal impulse at the correct position into the glottal-shape codevector. This is illustrated as the dotted impulse in Figure 12 b. This situation appears when the sum of the glottal impulse central position k' and the pitch period T0 is less than the subframe length Λ/, i.e.
' °) . But also in situations where the sum of the impulse position k' and pitch period exceeds the subframe length, the pitch period value is also used to build the fixed codevector when pitch sharpening in the algebraic codebook is used.
[0139] The repetition filter Q(z) is inserted into the TM part of the codec between the filters G(z) and H(z), as shown in the block diagram of Figure 13 for the encoder. The same change is made in the decoder. Similarly to pitch sharpening, the impulse response of the repetition filter Q(z) can be added to the impulse response of G(z) and H(z) prior to the codebook search so that both impulses are taken into account during the search while keeping the complexity of the search at a low level.
[0140] Another approach to build the glottal-shape codevector with two glottal impulses in one subframe is to use an adaptive codebook search in a part of the subframe. The first T0 samples of the glottal-shape codevector qk(n) are build using the glottal-shape codebook search and then the other samples in the subframe are build using the adaptive search as shown in Figure 14. This approach is more complex, but more accurate.
[0141] To further increase the encoding efficiency, the above described procedure can be used even if the second glottal impulse appears in one of the first L1/2 positions of the next subframe (Figure 15). In this situation, i.e. when k' and T0 hold N ≤ (k'+τo) < (N + Lm) 1 only a few samples (less than Lϊ/2+1) of the glottal shape are used at the end of the current subframe. This approach is used in the non-restrictive illustrative embodiment. This approach has a limitation because the pitch period value transmitted in these situations is limited to T0 < N (this is a question of effective encoding), although ideally its value should be limited to T0 ≤ N + L-ι/2. Therefore if the second glottal impulse appears at the beginning of the next subframe, the repetition procedure cannot be used for some of the first L1/2 glottal impulse positions /c' of the first glottal impulse.
Implementation of the TM Coding Technique in EV- VBR Codec
[0142] The TM coding technique according to the non-restrictive illustrative embodiment has been implemented in the EV-VBR codec. EV-VBR uses the internal sampling frequency of 12.8 kHz and the frame length of 20 ms. Each frame is divided into four subframes of Λ/ = 64 samples. The EV-VBR classification procedure has been adapted to select frames to be encoded using the TM coding technique. In this implementation, the gain of the glottal-shape codebook contribution is quantized in two steps as depicted in Figure 16, where G(z) is the shaping filter, k' is the position of the centre of the glottal shape and gm is a TM gain, i.e. a roughly quantized energy of the glottal-shape codevector. The TM gain gm is found in the same way as the pitch gain using Equation (4) only with the difference that it is not bounded. It is then quantized by means of a 3-bit scalar quantizer and one bit for sign is used. The glottal-shape codevector is then scaled using this gain gm. After both contributions to the filtered excitation signal (first and second stage contribution signals, i.e. the filtered glottal-shape codebook contribution and the filtered algebraic codebook contribution) are found, the gain of the first stage excitation signal is further adjusted jointly with the second stage excitation signal gain quantization, using the standard EV-VBR gain vector quantization (VQ). In this manner, the gain quantization codebooks of EV-VBR designed for generic or voiced coding modes could be used also in TM coding. Of course, it is within the scope of the present invention to perform the gain quantization using other, different methods.
[0143] The search of the glottal impulse central position W should be theoretically made for all positions in a subframe, i.e. within the range [0, Λ/-1]. Nevertheless as already mentioned, this search is computationally intensive given the number of glottal-shapes to be tried and, in practice, it can be done only in the interval of several samples around the position of the maximum absolute value in the LP residual signal. The searching interval can be set to ± 4 samples around the position of the first glottal impulse maximum in the LP residual signal in the current frame. In this manner, processing complexity is approximately the same as for the EV-VBR generic encoding using the adaptive and fixed codebook search.
[0144] The transmitted parameters related to the TM coding technique are listed in Table 4 with the corresponding number of bits. The parameter T0, which is used to determine the filter Q(z) or perform adaptive search for the second glottal impulse in case of two impulses in one subframe, is transmitted when T0 ≤ N. The remaining parameters used for a TM frame, but common with the generic ACELP processing, are not shown here (frame identification bits, LP parameters, pitch delay for adaptive excitation, fixed codebook excitation, 1st and 2nd stage codebook gains). When TM parameters are added to the bit stream, the number of bits originally allocated to other EV-VBR parameters is reduced in order to maintain a constant bit rate. These bits can be reduced for example from the fixed codebook excitation bits as well as from the gain quantization.
Table 4 - Parameters in the bit-stream transmitted for the subframe encoded using the TM.
Figure imgf000048_0001
[0145] The bit allocation tables used in EV-VBR are shown herein below. Let us recall that, when the glottal-shape codebook search is not applied to the first subframe, only the fixed codebook and its gain are transmitted to encode the excitation signal in subframes preceding the glottal-shape codebook subframe. The same situation happens for configurations TRANSITION_1_3 and TRANSITION^ _4. In those cases it is possible to maintain the same or even large size of fixed codebook for all subframes as in the original generic ACELP coding.
TM Technique Performance in EV-VBR Codec
[0146] In this section some examples of the performance of the TM coding technique in the EV-VBR codec implementation are presented. In Figure 17 an example of the impact of the TM coding technique is shown for clean-channel condition. Figure 17a shows the input speech signal, Figure 17b shows the LP residual signal and Figure 17c shows the first stage excitation signal where the TM coding technique is used in the first three (3) frames. As expected, the difference between the residual signal and the first stage excitation signal is more pronounced in the beginning of each frame. Towards the end of the frame, the first stage excitation signal corresponds more closely to the residual signal because the standard adaptive codebook search is used.
[0147] Tables 5 and 6 summarize some examples of the performance of the TM coding technique measured using SNR values.
[0148] In the first example (Table 5) a TM technique was implemented in codec with a core (inner) sampling frequency Fs = 8 kHz (i.e. a subframe length /V = 40 samples), glottal-shape codebook with sixteen (16) shapes of length seventeen (17) samples was used, and narrowband input signals were tested. From Table 5 it can be seen that coding voiced onset frames using the TM coding technique enhances the quality of output speech signal (see segmental and weighted segmental SNR values for 1 and 2 TM frames). Further SNR increase can be observed if the voiced onset frame and one following frame are encoded using the TM coding technique. However, if more than one frame following the voiced onset frame is also coded using the TM coding technique, the SNR values decreases. The weighted SNR is the SNR weighted by the frame energy normalized by the frame length, in dB.
Table 5 - SNR measurements comparison of the impact of the TM coding technique on NB signals.
Figure imgf000049_0001
Figure imgf000050_0001
[0149] Table 6 summarizes an example of the performance of the EV-
VBR codec with core (inner) sampling frequency Fs = 12.8 kHz, WB input speech signal and glottal-shape codebook with eight (8) shapes of length seventeen (17) samples. Mostly because of the longer subframe length N, the SNRs values show some degradation for clean channel when the TM coding technique is used, even if it is used in one frame only. This is caused mostly because of the limited length of the glottal-shape impulses. In comparison to the NB example, more zero values are presented in the first stage excitation signal in the subframe. The benefit of using the TM coding technique in this example is in the FE (Frame Erasure) protection.
Table 6 - SNR measurements comparison of the impact of the TM coding technique on WB signals.
Figure imgf000050_0002
[0150] It should be also noted that even when the TM coding technique is used in a frame after the erased frame, there is still some little difference between the synthesised speech in clean channel and noisy channel. This is because the encoder and the decoder internal states do not depend only on the past excitation signal, but also on many other parameters (e.g. filters memories, ISF (Immitance Spectral Frequencies) quantizer memories, ...). It is of course possible to test the variant when a memoryless LP parameters quantization optimized TM coding is used and all the internal states are reset for TM frames. This way all memories that the EV-VBR codec uses in the standard generic encoding mode were reset to ensure that decoder internal states after a frame erasure are the same as its states in error-free conditions. Nevertheless the speech quality in error-free conditions drops significantly for this variant. Consequently, there is a compromise to be made between the high performance in error-free conditions and the robustness to erased frames or packets when no additional memory resets are made.
[0151] Table 7 summarizes the computing complexity issue of the TM coding technique. In the worst case the TM coding technique increases the complexity in the encoder by 1.8 WMOPS (Weighted Millions of Operations Per Second). The complexity in the decoder remains approximately the same.
Table 7 - Complexity of the TM coding technique (worst case and average values).
Figure imgf000051_0001
[0152] The following figures illustrate the performance of the TM coding technique for voiced onset frame modeling (Figures 18a-18c) and for frame error propagation mitigation (Figures 19a-19c). The TM coding technique is used only in one frame at a time in this example. A segment of the input speech signal (Figures 18a and 19a), the corresponding output synthesized speech signal processed by the EV-VBR decoder without the TM coding technique as illustrated in Figures 18b and 19b, and the output synthesized speech signal processed using the standard EV-VBR decoder with TM coding technique (Figures 18c and 19c) are shown. The benefits of the TM coding technique can be observed both in the modeling of the voiced onset frame (2nd frame of Figure 18) and in the limitation of frame error propagation (4th and 5th frames of Figure 19).
[0153] The frame erasure concealment technique used in the EV-VBR decoder is based on the use of an extra decoder delay of 20 ms length (corresponding to one frame length). It means that if a frame is missing, it is concealed with the knowledge of the future frame parameters. Let us suppose three (3) consecutive frames that are denoted as m-1 , m and m+1 and further suppose a situation when the frame m is missing. Then an interpolation of the last correctly received frame m-1 and the following correctly received frame m+1 can be computed in view of determining the codec parameters, including in particular but not exclusively the LP filter coefficients (represented by /SFs - lmmitance Spectral Frequencies), closed-loop pitch period T0, pitch and fixed codebook gains. The interpolation helps to estimate the lost frame parameters more accurately for stable voiced segments. However, it often fails for transition segments when the codec parameters vary rapidly. To cope with this problem, the absolute value of the pitch period can be transmitted in every TM frame even in the case that it is not used for the first stage excitation construction in the current frame m+1. This is valid especially for configurations TRANSITION_1_4 and TRANSITION_4.
[0154] Other parameters transmitted in a TM frame are the /SFs of the preceding frame. In CELP-type encoders, the ISF parameters are generally interpolated between the previous frames /SFs and the current frame /SFs for each subframe. This ensures a smooth evolution of the LP synthesis filter from one subframe to another. In case of a frame erasure, the /SFs of the frame preceding the frame erasure are usually used for the interpolation in the frame following the erasure, instead of the erased frame /SFs. However during transition segments, the /SFs vary rapidly and the last-good frame /SFs might be very different from the /SFs of the missing, erased frame. Replacing the missing frame /SFs by the /SFs of the previous frame may thus cause important artefacts. If the past frame /SFs can be transmitted, they can be used for ISF interpolation in the TM frame in case the previous frame is erased. Later, different estimations of LP coefficients used for the ISF interpolation when the frame preceding a TM frame is missing will be described.
[0155] The final implementation of the TM coding technique to the EV-
VBR codec supposes that only one frame after onset/transition frame is coded using TM. In this manner, about 6.3% of active speech frames are selected for TM encoding and decoding.
[0156] Another category of tests focused on the increase in encoding efficiency. The classification was made in the closed-loop search when two variants - with and without the TM coding technique - were computed side by side in the encoder and a variant with a higher SNR was chosen as an output signal.
[0157] Results for the EV-VBR codec with bit rate of 8 kbps are summarized in Table 8. In the WB case, 28% of active speech frames was classified for encoding using the TM coding technique and an increase of 0.203 dB in segmental SNR was achieved. In the NB case, 25% of active speech frames was classified for encoding using the TM coding technique and an increase of even 0.300 dB in segmental SNR was achieved. Unfortunately this objective test increase was not confirmed by subjective listening tests that reported no preference between codec with and without the TM coding technique. Although there is no speech quality degradation and the total number of TM frames is four (4) times higher compared with an open-loop classification that results in much higher FE protection, this classification and similar result classifications are better not used in an EV-VBR codec implementation due to the increased complexity. Table 8 - Segmental SNR and SNR measure comparison between codec with and without TM coding technique implemented when close-loop classification is used.
Figure imgf000054_0001
Bit-allocation tables for TM coding technique in EV-VBR Codec
[0158] The TM coding technique was implemented in an EV-VBR codec candidate for ITU-T standardization. The following Table 9 shows bit allocation tables of the original generic mode and all TM coding mode configurations that were introduced herein above. These configurations are used in the EV-VBR codec.
Table 9 - Bit allocation tables for generic coding mode and for all TM configurations as used in the EV-VBR codec (ID stands for configuration identification, ISFs for lmmitance Spectral Frequencies and FCB for Fixed CodeBook, subfr. is subframe).
a) GENERIC b) TRANSITIONJJ c) TRANSITIONJ _2
# bits parameter # bits parameter # bits parameter
2 coder type 2 coder type 2 coder type
1 NB/WB 1 NB/WB 1 NB/WB
36 ISFs 36 ISFs 36 ISFs energy estimate 3 energy estimate 3 energy estimate
1st subfr. pitch 1 TM subfr. ID 1 TM subfr. ID
1st subfr. gains 5 1st subfr. pitch 1 TM subfr. ID pnd subfr. pitch 3 TM shape 3 TM shape subfr. gains 6 TM position 6 TM position
3rd subfr. pitch 1 TM gain sign 1 TM gain sign
3rd subfr. gains 3 TM gain value 3 TM gain value
4th subfr. pitch 5 1st subfr. gains 5 1st subfr. gains
4th subfr. gains 5 2nd subfr. pitch 1 TM subfr. ID2
1st subfr. FCB 5 2nd subfr. gains 1 7M subfr. ID2 subfr. FCB 5 3rd subfr. pitch 7 2nd subfr. pitch
Figure imgf000055_0001
subfr. FCB 5 3rd subfr. gains 5 2nd subfr. gains
4th subfr. FCB 5 4th subfr. pitch 5 3rd subfr. pitch
5 4th subfr. gains 5 3rd subfr. gains bits total 20 1st subfr. FCB 5 4th subfr. pitch
20 2nd subfr. FCB 5 4th subfr. gains
12 3rd subfr. FCB 20 1st subfr. FCB
12 4th subfr. FCB 20 2nd subfr. FCB
12 3rd subfr. FCB
160 bits total 12 4th subfr. FCB 160 bits total
f) TRANSITION_2
# /3/te parameter
2 coder type
1 NB/WB
36 ISFs
3 energy estimate
1 TM subfr. ID
1 TM subfr. /D
1 TM subfr. ID
2 1st subfr. gain
3 TM shape
6 TM position
1 TM gain sign
3 TM gain value
5 2nd subfr. gains
8 3rd subfr. pitch
Figure imgf000056_0001
3rd subfr. pitch 7 4th subfr. pitch 3rd
5 subfr. gains
3rd subfr. gains 5 4th subfr. gains 5 4th subfr. pitch
4th subfr. pitch 20 1st subfr. FCB 5 4th subfr. gains
subfr. gains ond
4th 20 subfr FCB 20 1st subfr. FCB
1st subfr. FCB 20 3rd subfr. FCB ond
20 subfr. FCB ond subfr. FCB 20 4th subfr. FCB 12 3rd subfr. FCB
subfr. FCB 20 4th subfr. FCB
4th subfr. FCB 160 bits total
160 bits total
bits total
g) TRANSITI0N_3 h) TRANSITION_4
# bits parameter # bits parameter
2 coder type 2 coder type
1 NBIWB 1 NBIWB
36 ISFs 36 ISFs
3 energy 3 energy estimate estimate
1 TM subfr. ID 1 TM subfr. ID 1 TM subfr. ID 1 TM subfr. ID
1 TM subfr. ID 1 TM subfr. ID
1 TM subfr. /D 1 TM subfr. /D
3 1st subfr. gain 3 1st subfr. gain
3 2nd subfr. gain 2 2nd subfr. gain
5 3rd subfr. pitch 3 3rd subfr. gain
3 TM shape 8 4th subfr. pitch
6 TM position 3 TM shape
1 TM gain sign 6 TM position
3 TM gain value 1 TM gain sign
5 3rd subfr. gains 3 TM gain value
8 4th subfr. pitch 5 4th subfr. gains
5 4th subfr. gains 20 1st subfr. FCB
12 1st subfr. FCB 20 2nd subfr. FCB
20 2nd subfr. FCB 20 3rd subfr. FCB
20 3rd subfr. FCB 20 4th subfr. FCB
20 4th subfr. FC8
160 bits total
160 bits total [0159] There is one exception to the configuration TRANSITION_2 in
Table 9. This bit-allocation table can be used only in the situation when it is decided to use the TM coding technique in the frames following the voiced onset frame only (the voiced onset frame is encoded using the generic coding mode and only one frame following the voiced onset frame is encoded using the TM coding technique). In this situation, the pitch period T0 is T0 ≥ N in the second subframe and there is no need to transmit this parameter in the 2nd subframe. But if the TM coding technique is used also in the voiced onset frame, the following situation may occur. The pitch period is shorter than Λ/, but the voiced onset can start only in the 2nd subframe (e.g. the first subframe still containing unvoiced signal). In this case the pitch period T0 must be transmitted. In this situation a different bit-allocation table is used, parameter T0 is transmitted in the 2nd subframe using five (5) bits and in one subframe a shorter fixed codebook is used (see Table 10). The same situation appears also for the configuration TRANSITION_3. However, the pitch period is transmitted here anyway in the present, non-limitative implementation (whether the onset frame is coded using the TM coding technique or not) because there is no good use of the saved bits for another parameter encoding.
[0160] Other bit allocations can be used in different transition mode configurations. For instance, more bits can be allocated to the fixed codebooks in the subframes containing glottal pulses. For example, in TRANSITION_3 mode, a FCB with twelve (12) bits can be used in the second subframe and twenty-eight (28) bits in the third subframe. Of course, other than 12- and 20-bit FCSs can be used in different coder implementations.
Table 10 - Bit allocation table for configuration TRANSITION_2 if TM is used also in the onset frame.
TRANSITION 2a
# bits parameter
2 coder type 1 NBIWB
36 ISFs
3 Energy estimate
1 TM subfr. ID
1 TM subfr. /D
1 TM subfr. /D
3 1st subfr. Gain
5 2nd subfr pitch
3 TM shape
6 TM position
1 TM gain sign
3 TM gain value
5 2nd subfr. Gains
8 3rd subfr. Pitch
5 3rd subfr. gains
5 4th subfr. Pitch
5 4th subfr. Gains
20 1st subfr. FCB
20 2nd subfr. FCB
12 3rd subfr. FCB
12 4th subfr. FCB
158 bits total [0161] If there is available bandwidth, further enhancement can be achieved by transmitting more information for better frame erasure [FE) protection. The VMR-WB codec is an example of a codec that uses some portion of FE protection bits. For example fourteen (14) protection bits per frame are used in the Generic Full-Rate encoding type in VMR-WB in Rate-Set II. These bits represent frame classification (2 bits), synthesized speech energy (6 bits) and glottal pulse position (6 bits). The glottal pulse is inserted artificially in the decoder when a voiced onset frame is lost. These FER protection bits are not much important for excitation construction in a TM frame because the TM coding technique does not make use of the past excitation signal; the TM coding technique constructs the excitation signal using parameters transmitted in the current (TM) frame. These bits can be however employed for the transmission of other parameters. In an example of implementation, these bits can be used to transmit in the current TM frame the ISF parameters of the previous frame; however twelve (12) bits instead of thirty-six (36) bits are available). These /SFs are used for more precise LP filter coefficients reconstruction in case of frame erasure.
[0162] In the EV-VBR codec the set of LP parameters is computed centered on the fourth subframe, whereas the first, second, and third subframes use a linear interpolation of the LP filter parameters between the current and the previous frame. The interpolation is performed on the /SPs (Immitance Spectral n(">) ,h „(">-!)
Pairs). Let 4^ be the ISP vector at the 4 subframe of the frame, and q« the ISP vector at the 4th subframe of the past frame m-1. The interpolated ISP vectors at the 1st, 2nd, and 3rd subframes are given by the Equations:
q^ O.SSq^ + O^q^ ,
(35) q^ O^q^ + O.δq^ , q^ O.CMq^ + O.Pόq^ .
[0163] This interpolation is however not directly suited for the TM coding technique in the case of erasure of the previous frame. When the frame preceding the TM frame is missing, it can be supposed that the last correctly received frame is unvoiced. It is more efficient in this situation to reconstruct the ISF vector for the missing frame with different interpolation constants and it does not matter if we have some /SFs information from FER protection bits available or not. In general, the interpolation is using the previous frame /SPs more heavily. The ISP vectors for the missing frame m can be given at the decoder, for example by using the following Equations:
(36)
Figure imgf000062_0001
[0164] The following correctly received TM frame m+1 then uses LP coefficients interpolation described by the Equations (35). Also the interpolation coefficients in Equations (36) are given as a non-limitative example. The final coefficients could be different and additionally it is desirable to use one set of interpolation coefficients when some ISF information from the previous frame is available and another set when ISF information from the previous frame is not available (i.e. there are no frame erasure protection bits in the bit stream). Pitch Period and Gain Encoding in TM Frames in EV-VBR Codec
[0165] The value of the pitch period T0 is transmitted for every subframe in the generic encoding mode used in the EV-VBR codec. In the 1st and 3rd subframes, an 8-bit encoding is used while the pitch period value is transferred with fractional [V2 for T0 in the range [Tmιn, 91 Vz]) or integer (for T0 in the range [92, Tmax\) resolution. In the 2nd and 4th subframes, a delta search is used and the pitch period value always with fractional resolution is coded with five (5) bits. Delta search means a search within the range [TOp - 8, T0p+7Y2], where TOp is the nearest integer to the fractional pitch period of the previous (1st or 3rd) subframe. The values of the pitch period are limited in the EV-VBR codec to values within the range [Tmιn, Tmax], where Tmιn = 34 and Tmax = 231 .
[0166] The pitch gain gp and the fixed codebook gain gc are encoded in the EV-VBR codec in principle in the same manner as in the AMR-WB+ codec [5]. First an estimation of a non-predictive scaled fixed codebook energy is calculated for all subframes in a frame and quantized with three (3) bits once per frame (see the parameter energy estimate in Table 9). Then the pitch gain gp and the fixed codebook gain gc are vector quantized and coded in one step using five (5) bits for every subframe.
[0167] The estimated fixed codebook energy is computed and quantized as follows. First, the LP residual energy is computed in each subframe k using the following Equation:
Figure imgf000063_0001
where u(n) is the LP residual signal. Then the average residual energy per subframe is found through the following Equation:
Figure imgf000064_0001
[0168] The fixed codebook energy is estimated from the residual energy by removing an estimate of the adaptive codebook contribution. This is done by removing an energy related to the average normalized correlation obtained from the two open-loop pitch analyses performed in the frame. The following Equation is used:
Es = Er^ - \0R , (39)
where R is the average of the normalized pitch correlations obtained from the open-loop pitch analysis for each half-frame of the current frame. The estimated scaled fixed codebook energy is not dependant on the previous frame energy and thus the gain encoding principle is robust to frame erasures.
[0169] Once the estimation of the fixed codebook energy is found, the pitch gain and the fixed codebook gain correction are computed: the estimated scaled fixed codebook energy is used to calculate the estimated fixed codebook gain and the correction factor y (ratio between the true and the estimated fixed codebook gains). The value γ is vector quantized together with the pitch gain using five (5) bits per subframe. For the design of the quantizer, a modified /(-means method [4] is used. The pitch gain is restricted within the interval <0; 1.2> during the codebook initialization and <0; ∞> during the iterative codebook improvement. Likewise, the correction factor y is limited by <0; 5> during initialization and <0; ∞> during the codebook improvement. The modified /c-means algorithm seeks to minimize the following criterion:
E
Figure imgf000064_0002
+ gc 2y2'y2 -
Figure imgf000064_0003
. (40) [0170] When using the TM coding technique, transmission of the pitch period and both pitch and fixed codebook gains may not be required for subframes where there is no important glottal impulse, and only the fixed codebook contribution may be computed.
[0171] The following is a list and description of all TM configurations:
[0172] Configuration TRANSITIONS _1 (Figure 20) - In this configuration one or two first glottal impulses appear in the first subframe that is processed using the glottal-shape codebook search. This means that the pitch period value in the first subframe can have a maximum value less than the subframe length, i.e. Tmn < T0 < N. With the integer resolution it can be coded with five (5) bits. The pitch periods in the next subframes are found using 5-bits delta search with a fractional resolution.
This is the most bit-demanding configuration of the TM coding technique, i.e. when the glottal-shape codebook is used in the first subframe and the pitch period To is transmitted for Q(z) filter determination, or for the adaptive codebook search in the part of the first subframe. This configuration uses in the first subframe the procedure as described above. This configuration is used in the EV-VBR codec also when only one glottal impulse appears in the first subframe. Here the pitch period T0 holds T0 < N and it is used for periodicity enhancement [1] in fixed codebook search.
[0173] Configuration TRANSITION_1_2 (Figure 21) - When the configuration TRANSITION_1_2 is used, the first subframe is processed using the glottal-shape codebook search. The pitch period is not needed and all following subframes are processed using the adaptive codebook search. Because the second subframe is known to contain the second glottal impulse, the pitch period maximum value holds To ≤ 2 Λ/ - 1. This maximum value can be further reduced thanks to knowledge of the glottal impulse position k'. The pitch period value in the second subframe is then coded using seven (7) bits with a fractional resolution in the whole range. In the third and fourth subframes, delta search using five (5) bits is used with a fractional resolution.
Configuration TRANSITION_1_3 (Figure 22) - When the configuration TRANSITION^ 3 is used the first subframe is processed using the glottal-shape codebook search again with no use of the pitch period. Because the second subframe of the LP residual signal contains no glottal impulse and the adaptive search is useless, the first stage excitation signal is replaced by zeros in the second subframe. The adaptive codebook parameters (T0 and gp) are not transmitted in the second subframe and saved bits are used for the FCB size increase in the third subframe. Because the second subframe contains a minimum of the useful information, only the 12-bits FCB is used and the 20-bits FCB is used in the fourth subframe. The first stage excitation signal in the third subframe is constructed using the adaptive codebook search with the pitch period maximum value (3 /V - 1 - k1) and minimum value (2 N - k); thus only a 7- bits encoding of the pitch period with fractional resolution over all the range is used. The fourth subframe is processed using the adaptive search again with a 5-bits delta search encoding of the pitch period value.
In the second subframe only the fixed codebook gain gc is transmitted. Consequently, only two (2) or three (3) bits are needed for gain quantization instead of the 5-bits quantizer used in the subframe with traditional ACELP encoding (i.e. when gains gp and gc are transmitted). This is valid also for all the following configurations. The decision as to whether the gain quantizer should use two (2) or three (3) bits is made to fit the number of bits available in the frame.
[0175] Configuration TRANSITION_1_4 (Figure 23) - When the configuration TRANSITIONS _4 is used, the first subframe is processed using the glottal-shape codebook search. Again, the pitch period does not need to be transmitted. But because the LP residual signal contains no glottal impulse in the second and also in the third subframe, the adaptive codebook search is useless for these two subframes. Again, the first stage excitation signal in these subframes is replaced by zeros and saved bits are used for the FCB size increase so that all subframes can benefit and use the 20-bits FCBs. The pitch period value is transmitted only in the fourth subframe and its minimum value is (3 N - k). The maximum value of the pitch period is limited by Tmgx. It does not matter if the second glottal impulse appears in the fourth subframe or not (the second glottal impulse can be present in the next frame if k'+ Tmax ≥ N ).
The absolute value of the pitch period is used at the decoder for the frame concealment; therefore this absolute value of the pitch period is transmitted in the situation when the second glottal impulse appears in the next frame. When a frame m preceding the TM frame m+1 is missing, the correct knowledge of the pitch period value from the frames m-1 and m+1 helps to reconstruct the missing part of the synthesis signal in the frame m successfully.
[0176] Configuration TRANSITION_2 (Figure 24) - When the first glottal impulse appears in the second subframe and only frames after voiced onset frames are encoded using the TM coding technique (i.e. the voiced onset frames are encoded with the legacy generic encoding), the pitch period is transmitted only in the third and fourth subframes. In this case, only fixed codebook parameters are transmitted in the first subframe.
The frame shown in Figure 24 supposes the configuration when TM is not used in voiced onset frames. If TM is used also in the voiced onset frames, the configuration TRANSITION_2a is used where the pitch period T0 is transmitted in the second subframe for using the procedure as described above.
[0177] Configuration TRANSITION_3 (Figure 25) - When the first glottal impulse appears in the third subframe and only frames after the voiced onset frames are encoded using the TM coding technique (i.e. the voiced onset frames are coded with the legacy generic encoding), the pitch period is transmitted only in the fourth subframe. In this case only fixed codebook parameters are transmitted in the first and second subframes.
The pitch period is still transmitted for the third subframe in the bit stream. However it is not useful if the TM coding technique is not used to encode the voiced onset frames. This value is useful only when voiced onset frames are encoded using the TM coding technique.
[0178] Configuration TRANSITION_4 (Figure 26) - When the first glottal impulse appears in the fourth subframe and only frames after voiced onset frames are encoded using the TM coding technique (i.e. the voiced onset frames are encoded with the legacy generic encoding), the pitch period value information is not used in this subframe. However the pitch period value is used in the frame concealment at the decoder (this value is used for the missing frame reconstruction when the frame preceding the TM frame is missing). Thus the pitch value is transmitted only in the fourth subframe and only fixed codebook parameters are transmitted in the first, second and third subframes (the gain pitch gp is not required). The saved bits allow for the 20-bits FCB to be used in every subframe.
[0179] Although the present invention has been described in the foregoing description in connection with a non-restrictive illustrative embodiment thereof, this non-restrictive illustrative embodiment can be modified at will, within the scope of the appended claims, without departing from the scope and spirit of the present invention.
References
[1] B. BESSETTE, R. SALAMI, R. LEFEBVRE, M. JELINEK, J. ROTOLA- PUKKILA, J. VAINIO, H. MIKKOLA, and K. JARVINEN, "The Adaptive Multi- Rate Wideband Speech Codec (AMR-WB)", Special Issue of IEEE Transactions on Speech and Audio Processing, Vol. 10, No. 8, pp. 620-636, November 2002.
[2] R. SALAMI, C. LAFLAMME, J-P. ADOUL, and D. MASSALOUX, 11A toll quality 8 kb/s speech codec for the personal communications system (PCS)", IEEE Trans, on Vehicular Technology, Vol. 43, No. 3, pp. 808-816, August 1994.
[3] 3GPP2 Tech. Spec. C.S0052-A v1.0, "Source-Controlled Variable-Rate Multimode Wideband Speech Codec (VMR-WB), Service Options 62 and 63 for Spread Spectrum Systems, " Apr. 2005; http://www.3gpp2.org
[4] S. P. Lloyd, "Least squares quantization in PCM," IEEE Transactions on Information Theory, Vol. 28, No.2, pp. 129-136, March 1982.
[5] 3GPP Tech. Spec. 26.290, "Adaptive Multi-Rate - Wideband (AMR-WB+) codec; Transcoding functions," June 2005.
[6] "Extended high-level description of the Q9 EV-VBR baseline codec," ITU-T SG16 Tech. Cont. COM16-C199R1-E, June 2007.

Claims

WHAT IS CLAIMED IS:
1. A transition mode device for use in a predictive-type sound signal codec for producing a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in the sound signal, comprising: an input for receiving a codebook index; and a transition mode codebook for generating a set of codevectors independent from past excitation, the transition mode codebook being responsive to the index for generating, in the transition frame and/or frame following the transition, one of the codevectors of the set corresponding to said transition mode excitation.
2. A transition mode device as defined in claim 1 , wherein the transition mode codebook comprises a fixed codebook independent from past excitation.
3. A transition mode device as defined in claim 1 , wherein the predictive-type sound signal codec comprises a decoder whereby, in operation, replacing the adaptive codebook excitation by the transition mode excitation in the transition frame and/or the frame following the transition reduces error propagation at the decoder in case of frame erasure and/or increase coding efficiency.
4. A transition mode device as defined in claim 1 , wherein the transition mode codebook comprises a codebook of glottal impulse shapes.
5. A transition mode device as defined in claim 1 , wherein the sound signal comprises a speech signal and wherein the transition frame is selected from the group consisting of a frame comprising a voiced onset and a frame comprising a transition between two different voiced sounds.
6. A transition mode device as defined in claim 1 , wherein the transition frame and/or the frame following the transition comprise a transition frame followed by several frames.
7. A transition mode device as defined in claim 6, wherein the transition frame and the several frames following the transition frame are consecutive frames.
8. A transition mode device as defined in claim 1 , wherein the transition frame and/or the frame following the transition comprises at least one frame following the transition.
9. A transition mode device as defined in claim 1 , wherein the predictive-type codec is a CELP-type codec and wherein the transition mode codebook replaces the adaptive codebook of the CELP-type codec in the transition frame and/or the frame following the transition.
10. A transition mode device as defined in claim 1 , wherein the transition frame and/or the frame following the transition each comprise a plurality of subframes, and wherein the transition mode codebook is used in a first part of the subframes and a predictive-type codebook of the predictive-type codec is used in a second part of the subframes.
11. A transition mode device as defined in claim 1 , wherein the codebook comprises a glottal-shape codebook comprising codevectors formed of a glottal impulse shape placed at a specific position in the codevector.
12. A transition mode device as defined in claim 11 , wherein the glottal-shape codebook includes a predetermined number of different shapes of glottal impulses, and wherein each shape of glottal impulse is positioned at a plurality of different positions in the codevector to form a plurality of different codevectors of the glottal- shape codebook.
13. A transition mode device as defined in claim 11 , wherein the glottal-shape codebook comprises a generator of codevectors containing only one non-zero element and a shaping filter for processing the codevectors containing only one non-zero element to produce codevectors representing glottal impulse shapes centered at different positions.
14. A transition mode device as defined in claim 13, wherein the predictive-type sound signal codec comprises an encoder comprising a weighted synthesis filter for processing the codevectors from the shaping filter representing glottal impulse shapes centered at different positions.
15. A transition mode device as defined in claim 13, wherein the transition frame and/or the frame following the transition each comprise a plurality of subframes, the glottal-shape codebook further comprises a repetition filter positioned downstream of the shaping filter for repeating, when there are more than one glottal impulse per subframe, the glottal impulse shape after a pitch period has elapsed.
16. A transition mode device as defined in claim 11 , wherein the glottal-shape impulses comprises first and last samples wherein a predetermined number of the first and last samples are truncated.
17. A transition mode device as defined in claim 13, further comprising an amplifier for applying a gain to the codevectors representing glottal impulses shapes centered at different positions.
18. An encoder device for generating a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in a sound signal, comprising: a generator of a codebook search target signal; a transition mode codebook for generating a set of codevectors independent from past excitation, the codevectors of said set each corresponding to a respective transition mode excitation; a searcher of the transition mode codebook for finding the codevector of said set corresponding to a transition mode excitation optimally corresponding to the codebook search target signal.
19. An encoder device as defined in claim 18, wherein the transition mode codebook comprises a fixed codebook independent from past excitation.
20. An encoder device as defined in claim 18, wherein the transition mode codebook comprises a codebook of glottal impulse shapes.
21. An encoder device as defined in claim 20, wherein the searcher applies a given criterion to every glottal impulse shape of the codebook of glottal impulse shapes and finds as the codevector optimally corresponding to the adaptive codebook search target signal the codevector of the set corresponding to a maximum value of said criterion.
22. An encoder device as defined in claim 21 , wherein the searcher identifies the found codevector by means of transition mode parameters selected from the group consisting of a transition mode configuration identification, a glottal impulse shape, a position of the glottal impulse shape centre in the found codevector, a transition mode gain, a sign of the transition mode gain and a closed-loop pitch period.
23. An encoder device as defined in claim 18, wherein the sound signal comprises a speech signal and wherein the transition frame is selected from the group consisting of a frame comprising a voiced onset and a frame comprising a transition between two different voiced sounds.
24. An encoder device as defined in claim 18, wherein the transition frame and/or the frame following the transition comprise a transition frame followed by several frames.
25. An encoder device as defined in claim 24, wherein the transition frame and the several frames following the transition frame are consecutive frames.
26. An encoder device as defined in claim 18, wherein the transition frame and/or the frame following the transition comprises at least one frame following the transition.
27. An encoder device as defined in claim 18, wherein the transition frame and/or the frame following the transition each comprise a plurality of subframes, and wherein the searcher searches the transition mode codebook in a first part of the subframes and a predictive-type codebook of the encoder device in a second part of the subframes.
28. An encoder device as defined in claim 18, wherein the transition mode codebook comprises a glottal-shape codebook comprising codevectors formed of a glottal impulse shape placed at a specific position in the codevector.
29. An encoder device as defined in claim 28, wherein the glottal-shape codebook includes a predetermined number of different shapes of glottal impulses, and wherein each shape of glottal impulse is positioned at a plurality of different positions in the codevector to form a plurality of different codevectors of the glottal- shape codebook.
30. An encoder device as defined in claim 28, wherein the glottal-shape codebook comprises a generator of codevectors containing only one non-zero element and a shaping filter for processing the codevectors containing only one non-zero element to produce codevectors representing glottal impulse shapes centered at different positions.
31. An encoder device as defined in claim 30, comprising a weighted synthesis filter for processing the codevectors from the shaping filter representing glottal impulse shapes centered at different positions.
32. An encoder device as defined in claim 30, wherein the transition frame and/or the frame following the transition each comprise a plurality of subframes, the glottal-shape codebook further comprises a repetition filter positioned downstream of the shaping filter for repeating, when there are more than one glottal impulse per subframe, the glottal impulse shape after a pitch period has elapsed.
33. An encoder device as defined in claim 28, wherein the glottal-shape impulses comprises first and last samples wherein a predetermined number of the first and last samples are truncated.
34. An encoder device as defined in claim 31 , further comprising an amplifier for applying a gain to the said codevectors representing glottal impulses shapes centered at different positions.
35. An encoder device as defined in claim 18, further comprising: a generator of an innovation codebook search target signal; an innovation codebook for generating a set of innovation codevectors each corresponding to a respective innovation excitation; a searcher of the innovation codebook for finding the innovation codevector of said set corresponding to an innovation excitation optimally corresponding to the innovation codebook search target signal; and an adder of the transition mode excitation and the innovation excitation to produce a global excitation for a sound signal synthesis filter.
36. An encoder device as defined in claim 35, wherein the transition frame and/or the frame following the transition each comprise a plurality of subframes and wherein, depending on where the glottal impulse or impulses are located in the subframes, the encoder device comprises means for encoding the subframes using at least one of the transition mode codebook, the adaptive codebook and the innovation codebook.
37. A decoder device for generating a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in a sound signal, comprising: an input for receiving a codebook index; a transition mode codebook for generating a set of codevectors independent from past excitation, the transition mode codebook being responsive to the index for generating in the transition frame and/or frame following the transition one of the codevectors of the set corresponding to the transition mode excitation.
38. A decoder device as defined in claim 37, wherein the transition mode codebook comprises a fixed codebook independent from past excitation.
39. A decoder device as defined in claim 37, wherein replacing the adaptive codebook excitation by the transition mode excitation in the transition frame and/or the frame following the transition reduces error propagation at the decoder device in case of frame erasure and/or to improve coding efficiency.
40. A decoder device as defined in claim 37, wherein the transition mode codebook comprises a codebook of glottal impulse shapes.
41. A decoder device as defined in claim 37, wherein the sound signal comprises a speech signal and wherein the transition frame is selected from the group consisting of a frame comprising a voiced onset and a frame comprising a transition between two different voiced sounds.
42. A decoder device as defined in claim 37, wherein the transition frame and/or the frame following the transition each comprise a plurality of subframes, and wherein the transition mode codebook is used in a first part of the subframes and the decoder device comprises a predictive-type codebook that is used in a second part of the subframes.
43. A decoder device as defined in claim 37, wherein the transition mode codebook comprises a glottal-shape codebook comprising codevectors formed of a glottal impulse shape placed at a specific position in the codevector.
44. A decoder device as defined in claim 43, wherein the glottal-shape codebook includes a predetermined number of different shapes of glottal impulses, and wherein each shape of glottal impulse is positioned at a plurality of different positions in the codevector to form a plurality of different codevectors of the glottal- shape codebook.
45. A decoder device as defined in claim 43, wherein the glottal-shape codebook comprises a generator of codevectors containing only one non-zero element and a shaping filter for processing the codevectors containing only one non-zero element to produce codevectors representing glottal impulse shapes centered at different positions.
46. A decoder device as defined in claim 45, further comprising an amplifier for applying a gain to the said codevectors representing glottal impulses shapes centered at different positions.
47. A decoder device as defined in claim 37, further comprising: an input for receiving an innovation codebook index; an innovation codebook for generating a set of innovation codevectors, the innovation codebook being responsive to the innovation codebook index for generating in the transition frame and/or frame following the transition one of the innovation codevectors of the set corresponding to an innovation excitation; an adder of the transition mode excitation and the innovation excitation to produce a global excitation for a sound signal synthesis filter.
48. A transition mode method for use in a predictive-type sound signal codec for producing a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in the sound signal, comprising: providing a transition mode codebook for generating a set of codevectors independent from past excitation; supplying a codebook index to the transition mode codebook; and generating, by means of the transition mode codebook and in response to the codebook index, one of the codevectors of the set corresponding to said transition mode excitation.
49. A transition mode method as defined in claim 48, wherein the transition mode codebook comprises a fixed codebook independent from past excitation.
50. A transition mode method as defined in claim 48, wherein the predictive- type sound signal codec comprises a decoder whereby, in operation, replacing the adaptive codebook excitation by the transition mode excitation in the transition frame and/or the frame following the transition reduces error propagation at the decoder in case of frame erasure and/or increase coding efficiency.
51. A transition mode method as defined in claim 48, wherein the transition mode codebook comprises a codebook of glottal impulse shapes.
52. A transition mode method as defined in claim 48, wherein the sound signal comprises a speech signal and said method comprises selecting the transition frame from the group consisting of a frame comprising a voiced onset and a frame comprising a transition between two different voiced sounds.
53. A transition mode method as defined in claim 48, wherein the transition frame and/or the frame following the transition comprise a transition frame followed by several frames.
54. A transition mode method as defined in claim 53, wherein the transition frame and the several frames following the transition frame are consecutive frames.
55. A transition mode method as defined in claim 48, wherein the transition frame and/or the frame following the transition comprises at least one frame following the transition.
56. A transition mode method as defined in claim 48, wherein the predictive-type codec is a CELP-type codec and said method comprises replacing the adaptive codebook of the CELP-type codec by the transition mode codebook in the transition frame and/or the frame following the transition.
57. A transition method method as defined in claim 48, wherein the transition frame and/or the frame following the transition each comprise a plurality of subframes, and said method comprises using the transition mode codebook in a first part of the subframes and a predictive-type codebook of the predictive-type codec in a second part of the subframes.
58. A transition mode method as defined in claim 48, wherein providing a transition mode codebook comprises providing a glottal-shape codebook comprising codevectors formed of a glottal impulse shape placed at a specific position in the codevector.
59. A transition mode method as defined in claim 58, wherein providing a glottal-shape codebook comprises providing a glottal-shape codebook including a predetermined number of different shapes of glottal impulses and forming in the glottal-shape codebook a plurality of different codevectors by positioning each shape of glottal impulse at a plurality of different positions in the codevector.
60. A transition mode method as defined in claim 58, comprising, in the glottal- shape codebook, generating codevectors containing only one non-zero element and processing through a shaping filter the codevectors containing only one nonzero element to produce codevectors representing glottal impulse shapes centered at different positions.
61. A transition mode method as defined in claim 60, wherein the predictive-type sound signal codec comprises an encoder comprising a weighted synthesis filter, said method further comprising processing the codevectors from the shaping filter representing glottal impulse shapes centered at different positions through the weighted synthesis filter.
62. A transition mode method as defined in claim 60, wherein the transition frame and/or the frame following the transition each comprise a plurality of subframes, and wherein generating one of the codevectors comprises repeating, when there are more than one glottal impulse per subframe, the glottal impulse shape after a pitch period has elapsed.
63. A transition mode method as defined in claim 58, wherein the glottal-shape impulses comprises first and last samples, said method comprising truncating a predetermined number of the first and last samples.
64. A transition mode method as defined in claim 60, further comprising applying a gain to the codevectors representing glottal impulses shapes centered at different positions.
65. An encoding method for generating a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in a sound signal, comprising: generating a codebook search target signal; providing a transition mode codebook for generating a set of codevectors independent from past excitation, the codevectors of said set each corresponding to a respective transition mode excitation; searching the transition mode codebook for finding the codevector of said set corresponding to a transition mode excitation optimally corresponding to the codebook search target signal.
66. An encoding method as defined in claim 65, wherein providing a transition mode codebook comprises providing a fixed codebook independent from past excitation.
67. An encoding method as defined in claim 65, wherein providing a transition mode codebook comprises providing a codebook of glottal impulse shapes.
68. An encoding method as defined in claim 67, wherein searching the transition mode codebook comprises applying a given criterion to every glottal impulse shape of the codebook of glottal impulse shapes and finds as the codevector optimally corresponding to the adaptive codebook search target signal the codevector of the set corresponding to a maximum value of said criterion.
69. An encoding method as defined in claim 68, wherein searching the transition mode codebook comprises identifying the found codevector by means of transition mode parameters selected from the group consisting of a transition mode configuration identification, a glottal impulse shape, a position of the glottal impulse shape centre in the found codevector, a transition mode gain, a sign of the transition mode gain and a closed-loop pitch period.
70. An encoding method as defined in claim 65, wherein the sound signal comprises a speech signal and said method further comprises selecting the transition frame from the group consisting of a frame comprising a voiced onset and a frame comprising a transition between two different voiced sounds.
71. An encoding method as defined in claim 65, wherein the transition frame and/or the frame following the transition comprise a transition frame followed by several frames.
72. An encoding method as defined in claim 71 , wherein the transition frame and the several frames following the transition frame are consecutive frames.
73. An encoding method as defined in claim 65, wherein the transition frame and/or the frame following the transition comprises at least one frame following the transition.
74. An encoding method as defined in claim 65, wherein the transition frame and/or the frame following the transition each comprise a plurality of subframes, and wherein searching the transition mode codebook comprises searching the transition mode codebook in a first part of the subframes and searching a predictive-type codebook of the encoder device in a second part of the subframes.
75. An encoding method as defined in claim 65, wherein providing a transition mode codebook comprises providing a glottal-shape codebook comprising codevectors formed of a glottal impulse shape placed at a specific position in the codevector.
76. An encoding method as defined in claim 75, wherein providing a glottal- shape codebook comprises providing a glottal-shape codebook including a predetermined number of different shapes of glottal impulses, and forming a plurality of different codevectors of the glottal-shape codebook by positioning each shape of glottal impulse at a plurality of different positions in the codevector.
77. An encoding method as defined in claim 75, wherein generating in the glottal-shape codebook a set of codevectors independent from past excitation comprises generating codevectors containing only one non-zero element and processing through a shaping filter the codevectors containing only one non-zero element to produce codevectors representing glottal impulse shapes centered at different positions.
78. An encoding method as defined in claim 77, comprising processing through a weighted synthesis filter the codevectors from the shaping filter representing glottal impulse shapes centered at different positions.
79. An encoding method as defined in claim 77, wherein the transition frame and/or the frame following the transition each comprise a plurality of subframes, and said method further comprises repeating, when there are more than one glottal impulse per subframe, the glottal impulse shape after a pitch period has elapsed.
80. An encoding method as defined in claim 75, wherein the glottal-shape impulses comprises first and last samples, said method comprising truncating a predetermined number of the first and last samples.
81. An encoding method as defined in claim 78, further comprising applying a gain to the codevectors representing glottal impulses shapes centered at different positions.
82. An encoding method as defined in claim 65, further comprising: generating an innovation codebook search target signal; providing an innovation codebook for generating a set of innovation codevectors each corresponding to a respective innovation excitation; searching the innovation codebook for finding the innovation codevector of said set corresponding to an innovation excitation optimally corresponding to the innovation codebook search target signal; and adding the transition mode excitation and the innovation excitation to produce a global excitation for a sound signal synthesis filter.
83. An encoding method as defined in claim 82, wherein the transition frame and/or the frame following the transition each comprise a plurality of subframes and wherein, depending on where the glottal impulse or impulses are located in the subframes, the encoding method comprises encoding the subframes using at least one of the transition mode codebook, the adaptive codebook and the innovation codebook.
84. A decoding method for generating a transition mode excitation replacing an adaptive codebook excitation in a transition frame and/or a frame following the transition in a sound signal, comprising: receiving a codebook index; supplying the codebook index to a transition mode codebook for generating a set of codevectors independent from past excitation; and generating, by means of the transition mode codebook and in response to the codebook index, one of the codevectors of the set corresponding to the transition mode excitation.
85. A decoding method as defined in claim 84, wherein the transition mode codebook comprises a fixed codebook independent from past excitation.
86. A decoding method as defined in claim 84, wherein replacing the adaptive codebook excitation by the transition mode excitation in the transition frame and/or the frame following the transition reduces error propagation at the decoder device in case of frame erasure and/or improves coding efficiency.
87. A decoding method as defined in claim 84, comprising providing as the transition mode codebook a codebook of glottal impulse shapes.
88. A decoding method as defined in claim 84, wherein the sound signal comprises a speech signal and wherein said method comprises selecting the transition frame from the group consisting of a frame comprising a voiced onset and a frame comprising a transition between two different voiced sounds.
89. A decoding method as defined in claim 84, wherein the transition frame and/or the frame following the transition each comprise a plurality of subframes, and wherein said method comprises using the transition mode codebook in a first part of the subframes and a predictive-type codebook in a second part of the subframes.
90. A decoding method as defined in claim 84, comprising providing as the transition mode codebook a glottal-shape codebook comprising codevectors formed of a glottal impulse shape placed at a specific position in the codevector.
91. A decoding method as defined in claim 90, wherein the glottal-shape codebook includes a predetermined number of different shapes of glottal impulses, and wherein said method comprises forming a plurality of different codevectors of the glottal-shape codebook by positioning each shape of glottal impulse at a plurality of different positions in the codevector.
92. A decoding method as defined in claim 90, wherein codevectors of the set are generated by the glottal-shape codebook by generating codevectors containing only one non-zero element and processing through a shaping filter the codevectors containing only one non-zero element to produce codevectors representing glottal impulse shapes centered at different positions.
93. A decoding method as defined in claim 92, further comprising applying a gain to the said codevectors representing glottal impulses shapes centered at different positions.
94. A decoding method as defined in claim 84, further comprising: providing an innovation codebook for generating a set of innovation codevectors; supplying an innovation codebook index to the innovation codebook; generating, by means of the innovation codebook and in response to the innovation codebook index, one of the innovation codevectors of the set corresponding to an innovation excitation; and adding the transition mode excitation and the innovation excitation to produce a global excitation for a sound signal synthesis filter.
PCT/CA2007/001896 2006-10-24 2007-10-24 Method and device for coding transition frames in speech signals WO2008049221A1 (en)

Priority Applications (12)

Application Number Priority Date Filing Date Title
EP07816046.2A EP2102619B1 (en) 2006-10-24 2007-10-24 Method and device for coding transition frames in speech signals
JP2009533622A JP5166425B2 (en) 2006-10-24 2007-10-24 Method and device for encoding transition frames in speech signals
ES07816046.2T ES2624718T3 (en) 2006-10-24 2007-10-24 Method and device for coding transition frames in voice signals
KR1020097010701A KR101406113B1 (en) 2006-10-24 2007-10-24 Method and device for coding transition frames in speech signals
CN2007800480774A CN101578508B (en) 2006-10-24 2007-10-24 Method and device for coding transition frames in speech signals
MX2009004427A MX2009004427A (en) 2006-10-24 2007-10-24 Method and device for coding transition frames in speech signals.
CA2666546A CA2666546C (en) 2006-10-24 2007-10-24 Method and device for coding transition frames in speech signals
DK07816046.2T DK2102619T3 (en) 2006-10-24 2007-10-24 METHOD AND DEVICE FOR CODING TRANSITION FRAMEWORK IN SPEECH SIGNALS
BRPI0718300-3A BRPI0718300B1 (en) 2006-10-24 2007-10-24 METHOD AND DEVICE FOR CODING TRANSITION TABLES IN SPEAKING SIGNS.
US12/446,892 US8401843B2 (en) 2006-10-24 2007-10-24 Method and device for coding transition frames in speech signals
NO20092017A NO341585B1 (en) 2006-10-24 2009-05-25 Method and apparatus for encoding transition frames in speech signals
HK09112127.2A HK1132324A1 (en) 2006-10-24 2009-12-24 Method and device for coding transition frames in speech signals

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US85374906P 2006-10-24 2006-10-24
US60/853,749 2006-10-24

Publications (1)

Publication Number Publication Date
WO2008049221A1 true WO2008049221A1 (en) 2008-05-02

Family

ID=39324068

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA2007/001896 WO2008049221A1 (en) 2006-10-24 2007-10-24 Method and device for coding transition frames in speech signals

Country Status (16)

Country Link
US (1) US8401843B2 (en)
EP (1) EP2102619B1 (en)
JP (1) JP5166425B2 (en)
KR (1) KR101406113B1 (en)
CN (1) CN101578508B (en)
BR (1) BRPI0718300B1 (en)
CA (1) CA2666546C (en)
DK (1) DK2102619T3 (en)
ES (1) ES2624718T3 (en)
HK (1) HK1132324A1 (en)
MX (1) MX2009004427A (en)
MY (1) MY152845A (en)
NO (1) NO341585B1 (en)
PT (1) PT2102619T (en)
RU (1) RU2462769C2 (en)
WO (1) WO2008049221A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010204391A (en) * 2009-03-03 2010-09-16 Nippon Telegr & Teleph Corp <Ntt> Voice signal modeling method, signal recognition device and method, parameter learning device and method, and feature value generating device, method, and program
JP2012507752A (en) * 2008-10-30 2012-03-29 クゥアルコム・インコーポレイテッド Coding scheme selection for low bit rate applications
JP2012507751A (en) * 2008-10-30 2012-03-29 クゥアルコム・インコーポレイテッド Coding transition speech frames for low bit rate applications
US9564143B2 (en) 2012-11-15 2017-02-07 Ntt Docomo, Inc. Audio coding device, audio coding method, audio coding program, audio decoding device, audio decoding method, and audio decoding program
CN106415715A (en) * 2014-05-01 2017-02-15 日本电信电话株式会社 Encoding device, decoding device, encoding and decoding methods, and encoding and decoding programs
US9852741B2 (en) 2014-04-17 2017-12-26 Voiceage Corporation Methods, encoder and decoder for linear predictive encoding and decoding of sound signals upon transition between frames having different sampling rates
RU2666327C2 (en) * 2013-06-21 2018-09-06 Фраунхофер-Гезелльшафт Цур Фердерунг Дер Ангевандтен Форшунг Е.Ф. Apparatus and method for improved concealment of the adaptive codebook in acelp-like concealment employing improved pulse resynchronization
US10381011B2 (en) 2013-06-21 2019-08-13 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for improved concealment of the adaptive codebook in a CELP-like concealment employing improved pitch lag estimation
WO2020223797A1 (en) 2019-05-07 2020-11-12 Voiceage Corporation Methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2112653A4 (en) * 2007-05-24 2013-09-11 Panasonic Corp Audio decoding device, audio decoding method, program, and integrated circuit
US8515767B2 (en) * 2007-11-04 2013-08-20 Qualcomm Incorporated Technique for encoding/decoding of codebook indices for quantized MDCT spectrum in scalable speech and audio codecs
US20090319261A1 (en) * 2008-06-20 2009-12-24 Qualcomm Incorporated Coding of transitional speech frames for low-bit-rate applications
KR101137652B1 (en) * 2009-10-14 2012-04-23 광운대학교 산학협력단 Unified speech/audio encoding and decoding apparatus and method for adjusting overlap area of window based on transition
JP5314771B2 (en) * 2010-01-08 2013-10-16 日本電信電話株式会社 Encoding method, decoding method, encoding device, decoding device, program, and recording medium
US9626982B2 (en) * 2011-02-15 2017-04-18 Voiceage Corporation Device and method for quantizing the gains of the adaptive and fixed contributions of the excitation in a CELP codec
NO2669468T3 (en) * 2011-05-11 2018-06-02
US9972325B2 (en) 2012-02-17 2018-05-15 Huawei Technologies Co., Ltd. System and method for mixed codebook excitation for speech coding
FR3001593A1 (en) * 2013-01-31 2014-08-01 France Telecom IMPROVED FRAME LOSS CORRECTION AT SIGNAL DECODING.
SI3848929T1 (en) * 2013-03-04 2023-12-29 Voiceage Evs Llc Device and method for reducing quantization noise in a time-domain decoder
CN105247614B (en) * 2013-04-05 2019-04-05 杜比国际公司 Audio coder and decoder
CN104301064B (en) 2013-07-16 2018-05-04 华为技术有限公司 Handle the method and decoder of lost frames
US10614816B2 (en) * 2013-10-11 2020-04-07 Qualcomm Incorporated Systems and methods of communicating redundant frame information
CN104637486B (en) * 2013-11-07 2017-12-29 华为技术有限公司 The interpolating method and device of a kind of data frame
CN103680509B (en) * 2013-12-16 2016-04-06 重庆邮电大学 A kind of voice signal discontinuous transmission and ground unrest generation method
CN106683681B (en) * 2014-06-25 2020-09-25 华为技术有限公司 Method and device for processing lost frame
FR3024582A1 (en) * 2014-07-29 2016-02-05 Orange MANAGING FRAME LOSS IN A FD / LPD TRANSITION CONTEXT
FR3024581A1 (en) * 2014-07-29 2016-02-05 Orange DETERMINING A CODING BUDGET OF A TRANSITION FRAME LPD / FD
KR101987565B1 (en) 2014-08-28 2019-06-10 노키아 테크놀로지스 오와이 Audio parameter quantization
US9916835B2 (en) * 2015-01-22 2018-03-13 Sennheiser Electronic Gmbh & Co. Kg Digital wireless audio transmission system
US10157441B2 (en) * 2016-12-27 2018-12-18 Automotive Research & Testing Center Hierarchical system for detecting object with parallel architecture and hierarchical method thereof
BR112020004909A2 (en) * 2017-09-20 2020-09-15 Voiceage Corporation method and device to efficiently distribute a bit-budget on a celp codec
MX2021009635A (en) * 2019-02-21 2021-09-08 Ericsson Telefon Ab L M Spectral shape estimation from mdct coefficients.
CN111123305B (en) * 2019-12-12 2023-08-22 秦然 Graphical noise coefficient optimization method for GNSS recording playback tester

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5119424A (en) * 1987-12-14 1992-06-02 Hitachi, Ltd. Speech coding system using excitation pulse train
US5495555A (en) * 1992-06-01 1996-02-27 Hughes Aircraft Company High quality low bit rate celp-based speech codec
US20010053972A1 (en) * 1997-12-24 2001-12-20 Tadashi Amada Method and apparatus for an encoding and decoding a speech signal by adaptively changing pulse position candidates
US20040018162A1 (en) * 2001-03-24 2004-01-29 Rudolf Bimczok Use of agents containing creatine, creatinine and/or derivatives thereof for strengthening and improving the structure of keratin fibers

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US549555A (en) * 1895-11-12 white
CA2108623A1 (en) * 1992-11-02 1994-05-03 Yi-Sheng Wang Adaptive pitch pulse enhancer and method for use in a codebook excited linear prediction (celp) search loop
EP1355298B1 (en) 1993-06-10 2007-02-21 Oki Electric Industry Company, Limited Code Excitation linear prediction encoder and decoder
WO1999010719A1 (en) 1997-08-29 1999-03-04 The Regents Of The University Of California Method and apparatus for hybrid coding of speech at 4kbps
EP1746583B1 (en) 1997-10-22 2008-09-17 Matsushita Electric Industrial Co., Ltd. Sound encoder and sound decoder
CN1494055A (en) * 1997-12-24 2004-05-05 ������������ʽ���� Method and apapratus for sound encoding and decoding
US7072832B1 (en) * 1998-08-24 2006-07-04 Mindspeed Technologies, Inc. System for speech encoding having an adaptive encoding arrangement
US6192335B1 (en) * 1998-09-01 2001-02-20 Telefonaktieboiaget Lm Ericsson (Publ) Adaptive combining of multi-mode coding for voiced speech and noise-like signals
JP4008607B2 (en) * 1999-01-22 2007-11-14 株式会社東芝 Speech encoding / decoding method
US6782360B1 (en) 1999-09-22 2004-08-24 Mindspeed Technologies, Inc. Gain quantization for a CELP speech coder
ATE420432T1 (en) 2000-04-24 2009-01-15 Qualcomm Inc METHOD AND DEVICE FOR THE PREDICTIVE QUANTIZATION OF VOICEABLE SPEECH SIGNALS
DE10124420C1 (en) * 2001-05-18 2002-11-28 Siemens Ag Coding method for transmission of speech signals uses analysis-through-synthesis method with adaption of amplification factor for excitation signal generator
EP1505573B1 (en) * 2002-05-10 2008-09-03 Asahi Kasei Kabushiki Kaisha Speech recognition device
CA2388439A1 (en) * 2002-05-31 2003-11-30 Voiceage Corporation A method and device for efficient frame erasure concealment in linear predictive based speech codecs
JP4414705B2 (en) * 2003-09-17 2010-02-10 パナソニック株式会社 Excitation signal encoding apparatus and excitation signal encoding method
US7668712B2 (en) * 2004-03-31 2010-02-23 Microsoft Corporation Audio encoding and decoding with intra frames and adaptive forward error correction
GB0408856D0 (en) * 2004-04-21 2004-05-26 Nokia Corp Signal encoding
CN1989548B (en) * 2004-07-20 2010-12-08 松下电器产业株式会社 Audio decoding device and compensation frame generation method
US7752039B2 (en) 2004-11-03 2010-07-06 Nokia Corporation Method and device for low bit rate speech coding

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5119424A (en) * 1987-12-14 1992-06-02 Hitachi, Ltd. Speech coding system using excitation pulse train
US5495555A (en) * 1992-06-01 1996-02-27 Hughes Aircraft Company High quality low bit rate celp-based speech codec
US20010053972A1 (en) * 1997-12-24 2001-12-20 Tadashi Amada Method and apparatus for an encoding and decoding a speech signal by adaptively changing pulse position candidates
US20040018162A1 (en) * 2001-03-24 2004-01-29 Rudolf Bimczok Use of agents containing creatine, creatinine and/or derivatives thereof for strengthening and improving the structure of keratin fibers

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ANDERSEN ET AL.: "ILBC - A Linear Predictive Coder with Robustness to Packet Loss", SPEECH CODING, 2002, IEEE WORKSHOP PROCEEDINGS, 6 October 2002 (2002-10-06) - 9 October 2002 (2002-10-09), pages 2325, XP010647200 *
ANDERSON ET AL.: "Pitch Resynchronization with Recovering from a Late Frame in a Predictive Speech Decoder", PROC. IEEE INT. CONF. ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, TOULOUSE, FRANCE, May 2006 (2006-05-01), pages 245 - 248, XP003009933 *
CHIBANI ET AL.: "Fast Recovery for a CELP-Like Speech Codec After a Frame Erasure", AUDIO, SPEECH AND LANGUAGE PROCESSING, IEEE TRANSACTIONS, vol. 15, no. 8, November 2007 (2007-11-01), pages 2485 - 2495, XP011192967 *
See also references of EP2102619A4 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8768690B2 (en) 2008-06-20 2014-07-01 Qualcomm Incorporated Coding scheme selection for low-bit-rate applications
JP2012507752A (en) * 2008-10-30 2012-03-29 クゥアルコム・インコーポレイテッド Coding scheme selection for low bit rate applications
JP2012507751A (en) * 2008-10-30 2012-03-29 クゥアルコム・インコーポレイテッド Coding transition speech frames for low bit rate applications
JP2010204391A (en) * 2009-03-03 2010-09-16 Nippon Telegr & Teleph Corp <Ntt> Voice signal modeling method, signal recognition device and method, parameter learning device and method, and feature value generating device, method, and program
US11195538B2 (en) 2012-11-15 2021-12-07 Ntt Docomo, Inc. Audio coding device, audio coding method, audio coding program, audio decoding device, audio decoding method, and audio decoding program
US11749292B2 (en) 2012-11-15 2023-09-05 Ntt Docomo, Inc. Audio coding device, audio coding method, audio coding program, audio decoding device, audio decoding method, and audio decoding program
US9564143B2 (en) 2012-11-15 2017-02-07 Ntt Docomo, Inc. Audio coding device, audio coding method, audio coding program, audio decoding device, audio decoding method, and audio decoding program
US9881627B2 (en) 2012-11-15 2018-01-30 Ntt Docomo, Inc. Audio coding device, audio coding method, audio coding program, audio decoding device, audio decoding method, and audio decoding program
US11211077B2 (en) 2012-11-15 2021-12-28 Ntt Docomo, Inc. Audio coding device, audio coding method, audio coding program, audio decoding device, audio decoding method, and audio decoding program
US11176955B2 (en) 2012-11-15 2021-11-16 Ntt Docomo, Inc. Audio coding device, audio coding method, audio coding program, audio decoding device, audio decoding method, and audio decoding program
US10553231B2 (en) 2012-11-15 2020-02-04 Ntt Docomo, Inc. Audio coding device, audio coding method, audio coding program, audio decoding device, audio decoding method, and audio decoding program
US20200126578A1 (en) 2012-11-15 2020-04-23 Ntt Docomo, Inc. Audio coding device, audio coding method, audio coding program, audio decoding device, audio decoding method, and audio decoding program
RU2666327C2 (en) * 2013-06-21 2018-09-06 Фраунхофер-Гезелльшафт Цур Фердерунг Дер Ангевандтен Форшунг Е.Ф. Apparatus and method for improved concealment of the adaptive codebook in acelp-like concealment employing improved pulse resynchronization
US11410663B2 (en) 2013-06-21 2022-08-09 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Apparatus and method for improved concealment of the adaptive codebook in ACELP-like concealment employing improved pitch lag estimation
US10381011B2 (en) 2013-06-21 2019-08-13 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for improved concealment of the adaptive codebook in a CELP-like concealment employing improved pitch lag estimation
US10643624B2 (en) 2013-06-21 2020-05-05 Fraunhofer-Gesellschaft zur Föerderung der Angewandten Forschung E.V. Apparatus and method for improved concealment of the adaptive codebook in ACELP-like concealment employing improved pulse resynchronization
US9852741B2 (en) 2014-04-17 2017-12-26 Voiceage Corporation Methods, encoder and decoder for linear predictive encoding and decoding of sound signals upon transition between frames having different sampling rates
US10468045B2 (en) 2014-04-17 2019-11-05 Voiceage Evs Llc Methods, encoder and decoder for linear predictive encoding and decoding of sound signals upon transition between frames having different sampling rates
US10431233B2 (en) 2014-04-17 2019-10-01 Voiceage Evs Llc Methods, encoder and decoder for linear predictive encoding and decoding of sound signals upon transition between frames having different sampling rates
US11282530B2 (en) 2014-04-17 2022-03-22 Voiceage Evs Llc Methods, encoder and decoder for linear predictive encoding and decoding of sound signals upon transition between frames having different sampling rates
EP3511935A1 (en) 2014-04-17 2019-07-17 VoiceAge Corporation Methods, encoder and decoder for linear predictive encoding and decoding of sound signals upon transition between frames having different sampling rates
US11721349B2 (en) 2014-04-17 2023-08-08 Voiceage Evs Llc Methods, encoder and decoder for linear predictive encoding and decoding of sound signals upon transition between frames having different sampling rates
EP4336500A2 (en) 2014-04-17 2024-03-13 VoiceAge EVS LLC Methods, encoder and decoder for linear predictive encoding and decoding of sound signals upon transition between frames having different sampling rates
CN106415715B (en) * 2014-05-01 2019-11-01 日本电信电话株式会社 Code device, coding method, recording medium
CN106415715A (en) * 2014-05-01 2017-02-15 日本电信电话株式会社 Encoding device, decoding device, encoding and decoding methods, and encoding and decoding programs
WO2020223797A1 (en) 2019-05-07 2020-11-12 Voiceage Corporation Methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack

Also Published As

Publication number Publication date
HK1132324A1 (en) 2010-02-19
KR20090073253A (en) 2009-07-02
DK2102619T3 (en) 2017-05-15
JP5166425B2 (en) 2013-03-21
EP2102619A1 (en) 2009-09-23
PT2102619T (en) 2017-05-25
NO341585B1 (en) 2017-12-11
ES2624718T3 (en) 2017-07-17
NO20092017L (en) 2009-05-25
JP2010507818A (en) 2010-03-11
RU2462769C2 (en) 2012-09-27
CN101578508B (en) 2013-07-17
EP2102619B1 (en) 2017-03-22
BRPI0718300B1 (en) 2018-08-14
EP2102619A4 (en) 2012-03-28
CN101578508A (en) 2009-11-11
US20100241425A1 (en) 2010-09-23
US8401843B2 (en) 2013-03-19
KR101406113B1 (en) 2014-06-11
RU2009119491A (en) 2010-11-27
CA2666546A1 (en) 2008-05-02
CA2666546C (en) 2016-01-19
BRPI0718300A2 (en) 2014-01-07
MY152845A (en) 2014-11-28
MX2009004427A (en) 2009-06-30

Similar Documents

Publication Publication Date Title
EP2102619B1 (en) Method and device for coding transition frames in speech signals
US8566106B2 (en) Method and device for fast algebraic codebook search in speech and audio coding
EP1886306B1 (en) Redundant audio bit stream and audio bit stream processing methods
JP6316398B2 (en) Apparatus and method for quantizing adaptive and fixed contribution gains of excitation signals in a CELP codec
CN101180676B (en) Methods and apparatus for quantization of spectral envelope representation
EP1576585B1 (en) Method and device for robust predictive vector quantization of linear prediction parameters in variable bit rate speech coding
EP2313887B1 (en) Variable bit rate lpc filter quantizing and inverse quantizing device and method
EP1062661B1 (en) Speech coding
US20070244695A1 (en) Selection of encoding modes and/or encoding rates for speech compression with closed loop re-decision
US9972325B2 (en) System and method for mixed codebook excitation for speech coding
EP1224662A1 (en) Variable bit-rate celp coding of speech with phonetic classification
CN107710324B (en) Audio encoder and method for encoding an audio signal
JPH09508479A (en) Burst excitation linear prediction
Eksler et al. Glottal-shape codebook to improve robustness of CELP codecs
Eksler et al. Transition mode coding for source controlled CELP codecs
Kroon et al. A low-complexity toll-quality variable bit rate coder for CDMA cellular systems
WO2020223797A1 (en) Methods and devices for detecting an attack in a sound signal to be coded and for coding the detected attack
Kim et al. A 4 kbps adaptive fixed code-excited linear prediction speech coder
JP2001100799A (en) Method and device for sound encoding and computer readable recording medium stored with sound encoding algorithm
Stegmann et al. CELP coding based on signal classification using the dyadic wavelet transform
Eksler et al. A new fast algebraic fixed codebook search algorithm in CELP speech coding.

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200780048077.4

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07816046

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2666546

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 12009500783

Country of ref document: PH

WWE Wipo information: entry into national phase

Ref document number: 2685/DELNP/2009

Country of ref document: IN

ENP Entry into the national phase

Ref document number: 2009533622

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 388669

Country of ref document: PL

Ref document number: MX/A/2009/004427

Country of ref document: MX

NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2007816046

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2007816046

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2009119491

Country of ref document: RU

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 1020097010701

Country of ref document: KR

WWE Wipo information: entry into national phase

Ref document number: 12446892

Country of ref document: US

ENP Entry into the national phase

Ref document number: PI0718300

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20090424