EP1129451A1 - Closed-loop variable-rate multimode predictive speech coder - Google Patents

Closed-loop variable-rate multimode predictive speech coder

Info

Publication number
EP1129451A1
EP1129451A1 EP99957560A EP99957560A EP1129451A1 EP 1129451 A1 EP1129451 A1 EP 1129451A1 EP 99957560 A EP99957560 A EP 99957560A EP 99957560 A EP99957560 A EP 99957560A EP 1129451 A1 EP1129451 A1 EP 1129451A1
Authority
EP
European Patent Office
Prior art keywords
coding
coding mode
mode
speech
threshold value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP99957560A
Other languages
German (de)
French (fr)
Inventor
Amitava Das
Sharath Manjunath
Andrew P. Dejaco
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qualcomm Inc
Original Assignee
Qualcomm Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qualcomm Inc filed Critical Qualcomm Inc
Publication of EP1129451A1 publication Critical patent/EP1129451A1/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/24Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding

Definitions

  • the present invention pertains generally to the field of speech processing, and more specifically to closed-loop, variable-rate, multimode, predictive coding of speech.
  • Speech coders typically comprise an encoder and a decoder, or a codec.
  • the encoder analyzes the incoming speech frame to extract certain relevant parameters, and then quantizes the parameters into binary representation, i.e., to a set of bits or a binary data packet.
  • the data packets are transmitted over the communication channel to a receiver and a decoder.
  • the decoder processes the data packets, unquantizes them to produce the parameters, and then resynthesizes the speech frames using the unquantized parameters.
  • the function of the speech coder is to compress the digitized speech signal into a low-bit-rate signal by removing all of the natural redundancies inherent in speech.
  • the challenge is to retain high voice quality of the decoded speech while achieving the target compression factor.
  • the performance of a speech coder depends on (1) how well the speech model, or the combination of the analysis and synthesis process described above, performs, and (2) how well the parameter quantization process is performed at the target bit rate of N G bits per frame. The goal of the speech model is thus to capture the essence of the speech signal, or the target voice quality, with a small set of parameters for each frame.
  • a multimode coder applies different modes, or encoding- decoding algorithms, to different types of input speech frames.
  • Each mode, or encoding-decoding process is customized to represent a certain type of speech segment (i.e., voiced, unvoiced, or background noise) in the most efficient manner.
  • An external mode decision mechanism examines the input speech frame and make a decision regarding which mode to apply to the frame.
  • the mode decision is done in an open-loop fashion by extracting a number of parameters out of the input frame and evaluating them to make a decision as to which mode to apply.
  • the mode decision is made without knowing in advance the exact condition of the output speech, i.e., how similar the output speech will be to the input speech in terms of voice-quality or any other performance measure.
  • An exemplary open-loop mode decision for a speech codec is described in U.S. Patent No. 5,414,796, which is assigned to the assignee of the present invention and fully incorporated herein by reference.
  • Multimode coding can be fixed-rate, using the same number of bits N 0 for each frame, or variable-rate, in which different bit rates are used for different modes.
  • the goal in variable-rate coding is to use only the amount of bits needed to encode the codec parameters to a level adequate to obtain the target quality.
  • VBR variable-bit-rate
  • Conventional VBR speech coders are designed with modes having different bit-rates.
  • An exemplary variable rate speech coder is described in U.S. Patent No. 5,414,796, assigned to the assignee of the present invention and previously fully incorporated herein by reference.
  • the codec described in the aforesaid patent has the following four rates: (1) full rate (FR); (2) half rate (HR); (3) quarter rate (QR); and (4) eighth rate (ER).
  • FR full rate
  • HR half rate
  • QR quarter rate
  • ER eighth rate
  • each frame of speech is encoded by 160, eighty, forty, and twenty bits per frame, respectively.
  • An external open-loop mode decision is made regarding which mode (FR, HR, QR or ER) to apply to the input speech frame.
  • the application areas include wireless telephony, satellite communications, Internet telephony, various multimedia and voice-streaming applications, voice mail, and other voice storage systems.
  • the driving forces are the need for high capacity and the demand for robust performance under packet loss situations.
  • Various recent speech coding standardization efforts are another direct driving force propelling research and development of low-rate speech coding algorithms.
  • a low-rate speech coder creates more channels, or users, per allowable application bandwidth, and a low- rate speech coder coupled with an additional layer of suitable channel coding can fit the overall bit-budget of coder specifications and deliver a robust performance under channel error conditions.
  • Conventional speech coders typically use some form of prediction mechanism to encode the current frame.
  • a speech coder exploits and uses the information contained in the last decoded and recreated frame. This works well because there is typically strong correlation, or similarity, between successive frames.
  • P(n) is a conventional prediction filter that produces an approximation of current frame from past quantized frame, tJTie quantized version of the prediction error E cur (n) of the current frame.
  • SNR signal-to- noise ratio
  • PSNR perceptual SNR
  • the prediction filter information is necessarily sent to the decoder as a certain number of bits, Np.
  • the remaining available bits, No - Np can be used to encode the prediction error signal E cur . If the prediction from the quantized past frame, S prev qunntlzed , generates an excellent predicted representation S cur _ predlcted of the current frame S cur , the prediction error E cur will be small, having a low dynamic range. Hence, it will be relatively easy to encode the prediction error E cur with a small number of bits.
  • the total number of bits per frame, No is high.
  • the QCELP ⁇ supports 260 bits per 20-ms frame. Therefore, even after allocating a number of bits, Np, to quantize the prediction filter parameter, there are enough remaining bits, No-Np, to accurately encode the prediction error.
  • Np a number of bits
  • No-Np a number of bits
  • a speech coder advantageously includes a codec configured to operate in at least one of a plurality of coding modes; and a closed- loop mode decision module coupled to the codec and configured to apply a first coding mode from the plurality of coding modes to an input speech frame, the first coding mode having a first bit rate that is lower than the bit rate of any other coding mode of the plurality of coding modes, the closed-loop mode decision module being further configured to obtain a performance measure of the codec, compare the performance measure with a threshold value, and, if the performance measure does not exceed the threshold value, reject the first coding mode in favor of a second coding mode having a second bit rate that is greater than the first bit rate.
  • a method of coding speech frames advantageously includes the steps of selecting a first coding mode to apply to a speech frame, the first coding mode having a first bit rate; obtaining a coding performance measure; comparing the coding performance measure with a threshold value; and rejecting the first coding mode in favor of a second coding mode if the coding performance measure does not exceed the threshold value, the second coding mode having a second bit rate that exceeds the first bit rate.
  • a speech coder advantageously includes means for selecting a first coding mode to apply to a speech frame, the first coding mode having a first bit rate; means for obtaining a coding performance measure; means for comparing the coding performance measure with a threshold value; and means for rejecting the first coding mode in favor of a second coding mode if the coding performance measure does not exceed the threshold value, the second coding mode having a second bit rate that exceeds the first bit rate.
  • FIG. 1 is a block diagram of a communication channel terminated at each end by speech coders.
  • FIG. 2 is a block diagram of an encoder.
  • FIG. 3 is a block diagram of a decoder.
  • FIG. 4 is a flow chart illustrating the steps of a closed-loop, multimode, predictive coding technique for speech frames at low bit rates.
  • a first encoder 10 receives digitized speech samples s(n) and encodes the samples s(n) for transmission on a transmission medium 12, or communication channel 12, to a first decoder 14.
  • the decoder 14 decodes the encoded speech samples and synthesizes an output speech signal s SYNTH (n).
  • a second encoder 16 encodes digitized speech samples s(n), which are transmitted on a communication channel 18.
  • a second decoder 20 receives and decodes the encoded speech samples, generating a synthesized output speech signal s SYNTH (n).
  • the speech samples s(n) represent speech signals that have been digitized and quantized in accordance with any of various methods known in the art including, e.g., pulse code modulation (PCM), companded ⁇ -law, or A-law.
  • PCM pulse code modulation
  • the speech samples s(n) are organized into frames of input data wherein each frame comprises a predetermined number of digitized speech samples s(n). In an exemplary embodiment, a sampling rate of 8 kHz is employed, with each 20 ms frame comprising 160 samples.
  • the rate of data transmission may advantageously be varied on a frame-to- frame basis from 8 kbps (full rate) to 4 kbps (half rate) to 2 kbps (quarter rate) to 1 kbps (eighth rate). Varying the data transmission rate is advantageous because lower bit rates may be selectively employed for frames containing relatively less speech information. As understood by those skilled in the art, other sampling rates, frame sizes, and data transmission rates may be used.
  • the first encoder 10 and the second decoder 20 together comprise a first speech coder, or speech codec.
  • the second encoder 16 and the first decoder 14 together comprise a second speech coder.
  • speech coders may be implemented with a digital signal processor (DSP), an application-specific integrated circuit (ASIC), discrete gate logic, firmware, or any conventional programmable software module and a microprocessor.
  • the software module could reside in RAM memory, flash memory, registers, or any other form of writable storage medium known in the art.
  • any conventional processor, controller, or state machine could be substituted for the microprocessor.
  • Exemplary ASICs designed specifically for speech coding are described in U.S. Patent No. 5,727,123, assigned to the assignee of the present invention and fully incorporated herein by reference, and U.S. Application Serial No. 08/197,417, entitled VOCODER ASIC, filed February 16, 1994, assigned to the assignee of the present invention, and fully incorporated herein by reference.
  • an encoder 100 that may be used in a speech coder includes a mode decision module 102, a pitch estimation module 104, an LP analysis module 106, an LP analysis filter 108, an LP quantization module 110, and a residue quantization module 112.
  • Input speech frames s(n) are provided to the mode decision module 102, the pitch estimation module 104, the LP analysis module 106, and the LP analysis filter 108.
  • the mode decision module 102 produces a mode index I M and a mode M based upon the periodicity of each input speech frame s(n).
  • Various methods of classifying speech frames according to periodicity are described in U.S. Application Serial No.
  • the pitch estimation module 104 produces a pitch index I P and a lag value
  • the LP analysis module 106 performs linear predictive analysis on each input speech frame s(n) to generate an LP parameter a.
  • the LP parameter a is provided to the LP quantization module 110.
  • the LP quantization module 110 also receives the mode M.
  • the LP quantization module 110 produces an LP index I LP and a quantized LP parameter u .
  • the LP analysis filter 108 receives the quantized LP parameter a in addition to the input speech frame s(n).
  • the LP analysis filter 108 generates an LP residue signal R[n], which represents the error between the input speech frames s(n) and the quantized linear predicted parameters .
  • the LP residue R[n], the mode M, and the quantized LP parameter a are provided to the residue quantization module 112. Based upon these values, the residue quantization module 112 produces a residue index I ana 1 a quantized residue signal — .
  • a decoder 200 that may be used in a speech coder includes an LP parameter decoding module 202, a residue decoding module 204, a mode decoding module 206, and an LP svnthesis filter 208.
  • the mode decoding module 206 receives and decodes a mode index I M , generating therefrom a mode M.
  • the LP parameter decoding module 202 receives the mode M and an LP index I LP .
  • the LP parameter decoding module 202 decodes the received values to produce a quantized LP parameter a .
  • the residue decoding module 204 receives a residue index I ⁇ , a pitch index I P , and the mode index I M .
  • the residue decoding module 204 decodes the received values to generate a quantized residue signal ⁇ [ ' m + i .
  • the quantized residue signal R[n] and the quantized LP parameter a are provided to the LP synthesis filter 208, which synthesizes a decoded output speech signal s[n] therefrom.
  • a multimode coder first uses an open-loop decision mode, relying on parameters extracted out of the current frame to classify the current frame as background-noise /silence (N), unvoiced speech (UV), or voiced speech (V).
  • N-type frames are coded with an eighth-rate mode
  • UV-type frames are coded with a quarter- rate mode.
  • V-type frames i.e., voiced speech frames
  • the full-rate mode may advantageously be a prediction-based coding scheme with adequate bits to accurately encode various types of voiced speech, delivering a perceptual signal- to-noise ratio (PSNR) well above the target PSNR (a predefined or variable threshold value).
  • PSNR perceptual signal- to-noise ratio
  • the half-rate mode is advantageously a prediction-based coding scheme designed to encode frames a high degree of correlation with the previous frame (i.e., frames that are quite similar to the previous frame).
  • the number of bits available in the half-rate mode is adequate to encode the prediction parameters for frames with high correlation, as well as the prediction error, which is relatively small due to the high correlation between successive frames.
  • Such frames are typically encountered in steady voiced speech segments, which are therefore amenable to half-rate coding.
  • the performance of prediction-based coding schemes also depends on how accurately the previous frame is quantized.
  • a closed-loop mode selection process is employed after the open-loop mode to ensure that the coding performance exceeds the predefined (or variable) target PSNR value.
  • the open-loop mode need not necessarily be applied at all.
  • the flow chart of FIG. 4 illustrates a closed-loop, multimode, predictive coding technique for speech frames at low bit rates, in accordance with one embodiment.
  • a frame number counter is set equal to 1.
  • the algorithm then proceeds to step 302, starting the coding process.
  • the algorithm then proceeds to step 304.
  • the algorithm checks the current frame and the previous quantized frame.
  • the algorithm then proceeds to step 306.
  • the algorithm determines whether the current frame should be classified as silence or background noise. This determination is made in accordance with various conventional techniques for measuring frame energy, such as, e.g., calculating the sum-of-squares. If the frame is classified as silence or background noise, the algorithm proceeds to step 308.
  • the algorithm applies an eighth-rate coding mode to the frame.
  • step 310 the algorithm determines whether the current frame should be classified as unvoiced speech. This determination is made in accordance with various known methods of periodicity determination, such as, e.g., the use of zero crossings and normalized autocorrelation functions (NACFs). These techniques are described in the aforementioned U.S. Application Serial No. 08/815,354, previously fully incorporated herein by reference. If the frame is classified as unvoiced speech, the algorithm proceeds to step 314. In step 314 a quarter-rate coding mode is applied to the frame. The algorithm then proceeds to step 310.
  • NACFs normalized autocorrelation functions
  • step 312 the algorithm proceeds to step 316, considering the frame to contain voiced speech.
  • step 316 the algorithm goes to a half-rate prediction-based coding mode.
  • step 318 the PSNR is computed.
  • the algorithm then proceeds to step 320.
  • step 320 the algorithm determines whether the computed PSNR is greater than a predefined threshold, or target, PSNR value.
  • the threshold, or target, PSNR value may be a function of average bit rate. For example, the average bit rate is calculated periodically and fed back to the algorithm, which adjusts the target threshold value accordingly. Further, it should be understood that any conventional measure of performance may be substituted for PSNR.
  • the algorithm proceeds to step 322. In step 322 a half-rate coding mode is applied to the frame. The algorithm then proceeds to step 310. If, on the other hand, in step 320 the computed PSNR does not exceed the target PSNR, the algorithm proceeds to step 324. In step 324 the algorithm applies a full-rate coding mode to the frame. The algorithm then proceeds to step 310.
  • step 310 the frame number counter is incremented by 1.
  • the algorithm then proceeds to step 326.
  • step 326 the algorithm determines whether the frame number counter value is greater than or equal to the total number of frames that must be processed (i.e., whether there are any remaining frames to process). If the frame number counter value is less than the total number of frames to be processed, the algorithm returns to step 302, beginning the coding process for the next frame. If, on the other hand, the frame number counter value is greater than or equal to the total number of frames to be processed, the algorithm proceeds to step 328, ending the coding process.
  • the full-rate coding mode described above with respect to FIG. 4 could be a higher-bit-rate predictive mechanism (i.e., any bit rate that is greater than half-rate).
  • a higher-bit-rate, direct coding mechanism is substituted for the full-rate, predictive coding mode.
  • the direct coding mode encodes the current speech frame or residue without using any information from the previous frame.
  • a direct encoding method is appropriate for speech segments for which there is no similarity between the current frame and the previous frame.
  • An example is during the onset of a voice segment.
  • Another example is unvoiced-to- voiced segment transitions.
  • a direct encoding method is also useful in the middle of voiced segments when the cumulative effect of prediction-based encoding has degraded the past quantized frame so as to be too far out of sync with the corresponding original speech frame. In this case predictive coding will fail, even at much higher bit rates, due to the lack of similarity between the past quantized frame and the past original frame.
  • a fresh capture of the current frame with a direct encoding method will not only enhance the preservation of the current frame, but will also facilitate future prediction-based encoding of the next and later frames because the prediction mechanism will be aided by a more accurate memory.
  • the Rl coding method is a higher-rate, direct coding method.
  • the R2 coding method is a lower-rate, predictive coding method.
  • a closed-loop decision is performed such that the R2 coding method is tried first, the performance is checked by comparing with a performance measure, and the algorithm switches to the Rl coding method if the performance for the R2 coding mode is insufficient.
  • the higher-rate, Rl coding mode is tried first, the performance is checked by comparing with a performance measure, and, if the performance is satisfactory, the lower-rate, R2 coding mode is tried.
  • the performance check is then performed for the R2 coding mode, and if the R2 coding mode performance is unsatisfactory, the Rl coding mode is applied to the frame.
  • multiple coding modes having bit rates R1,R2,...,RN-1,RN (where R1>R2>...>RN-1>RN) are employed.
  • a closed-loop decision is performed such that the lowest rate, RN, is tried first. If the RN coding mode performs adequately, the RN coding mode is retained for the frame. Otherwise, the next, higher-rate coding mode, RN-1, is applied. The process is reiterated until either a coding mode performs adequately or the highest-rate mode, Rl, is retained. In an alternate embodiment, the highest rate, Rl, is tried first. If the Rl mode performs adequately, the next, lower-rate coding mode, R2, is tried. The process is continued until a given coding mode does not perform adequately (at which time the last coding mode to perform adequately is applied), or until the lowest-rate coding mode, RN, performs satisfactorily and is applied.
  • multiple coding modes having bit rates Rl,R2,...,Rm-l,Rm,Rm+l,...,RN are employed.
  • the bit rates have the following relative magnitudes: Rl>R2>Rm-l>Rm>Rm+l>RN.
  • a closed-loop mode decision works in conjunction with an open-loop mode decision.
  • the open-loop mode decision based upon parameters such as frame energy or frame periodicity, tells the coder to apply a mode with a bit rate of Rm, at which point the closed-loop mode decision takes over.
  • the closed-loop mode decision applies the Rm coding mode, tests performance, and maintains the Rm coding mode if performance is satisfactory.
  • the closed-loop mode decision tries the next, higher-rate coding mode, Rm-1. The process is reiterated until either a coding mode performs adequately or the highest-rate mode, Rl, is retained. Alternatively, the closed-loop mode decision applies the Rm coding mode, tests performance, and maintains the Rm coding mode if performance is satisfactory. Otherwise, the closed-loop mode decision tries the next, lower-rate coding mode, Rm+1. The process is reiterated until either a coding mode performs inadequately (at which time the last coding mode to perform adequately is applied), or the lowest-rate mode, RN, is retained.
  • multiple coding modes having bit rates R1,R2,...,RN (where R1>R2>...>RN) are employed. All of the coding modes are applied in parallel to the input speech frame, and the performances of the coding modes are compared with a set of N threshold performance measures. The coding mode that appears to produce the most accurate result is selected.
  • multiple coding modes having bit rates R1,R2,...,RN are employed. All of the coding modes are applied in parallel to the input speech frame, and the performances of the coding modes are compared with a set of N threshold performance measures. If several coding modes exceed the performance threshold target, the coding mode having the lowest bit rate (and also performing above the performance threshold) is selected.
  • multiple coding modes having bit rates Rl,R2,...,Quarter Rate,..., Half Rate,...,RN (where Rl is Full Rate and RN is Eighth Rate) are employed.
  • a closed-loop mode decision works in conjunction with an open-loop mode decision.
  • the open-loop mode decision based upon parameters such as frame energy or frame periodicity, tells the coder to apply the full-rate coding mode to unvoiced-to-voiced transition frames, voiced-to-voiced transition frames, nonstationary voiced segments, and nonstationary unvoiced segments. Also based upon frame parameters, the open-loop mode decision tells the coder to apply the half-rate coding mode to steady-voiced segments that exhibit a significant degree of similarity from frame to frame.
  • the open-loop mode decision tells the coder to apply the quarter-rate coding mode to steady unvoiced segments. Also based upon frame parameters, the open-loop mode decision tells the coder to apply the eighth-rate coding mode to background noise and other nonspeech signals such as silence.
  • the closed-loop mode decision takes over. The closed-loop mode decision applies the coding mode selected by the open-loop mode decision, tests performance, and maintains the selected coding mode if performance is satisfactory. Otherwise, the closed-loop mode decision tries the next, higher-rate coding mode. The process is reiterated until either a coding mode performs adequately or the full-rate mode is retained.
  • the closed-loop mode decision applies the coding mode selected by the open-loop mode decision, tests performance, and maintains the selected coding mode if performance is satisfactory. Otherwise, the closed-loop mode decision tries the next, lower-rate coding mode. The process is reiterated until either a coding mode performs inadequately (at which time the last coding mode to perform adequately is applied), or the lowest-rate mode is retained.
  • the MCCi and Mi coding modes each use the same source-coding mode (i.e., the same encoder and decoder).
  • the MCCi coding mode includes an additional layer of channel protection, in which (RCCi-Ri) bits are used for robust protection of the parameters of the Mi coding mode under the worst possible channel condition of the communication system.
  • the performance, or voice quality, delivered by the Mi coding mode under channel-error-free conditions is similar to the performance, or voice quality, delivered by the MCCi coding mode under the worst possible channel error condition.
  • the (RCCi-Ri) channel coding bits serve to provide adequate protection under the assumed, or target, worst channel condition.
  • the assumed worst channel condition may advantageously be, e.g., a predefined percentage of frame error rate (FER).
  • FER frame error rate
  • a closed-loop mode decision advantageously accounts for both channel variation and source variation to deliver a guaranteed quality of service. For example, a source-controlled, closed-loop mode decision such as described above is applied first. The closed-loop mode decision tells the coder to use the Mi coding mode.
  • MCCi,j-RCCi represents the minimum number of bits needed to add channel error protection to the channel coding layer so that the channel error protection will be adequate for the worst-case scenario in the j-th channel error condition.
  • Such a closed-loop, combined-network-and-source-controlled codec delivers guaranteed quality of service across various channel conditions while also delivering a low average bit rate.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

A closed-loop, multimode, predictive speech coder includes a codec (100, 200) configured to operate in any of several coding modes, and a closed-loop mode decision module configured to apply a lowest-bit-rate coding mode to an input speech frame. A performance measure of the codec is obtained and compared with a threshold value. If the performance measure does not exceed the threshold value, the lowest-bit-rate coding mode is rejected in favor of a coding mode with a higher bit rate. The process can be continued until the coding performance is satisfactory. A higher-bit-rate, direct coding mode may be applied after a lower-bit-rate, prediction-based coding mode has failed to perform satisfactorily.

Description

CLOSED-LOOP VARIABLE-RATE MULTIMODE PREDICTIVE
SPEECH CODER
BACKGROUND OF THE INVENTION
I. Field of the Invention
The present invention pertains generally to the field of speech processing, and more specifically to closed-loop, variable-rate, multimode, predictive coding of speech.
II. Background of Invention
Transmission of voice by digital techniques has become widespread, particularly in long distance and digital radio telephone applications. This, in turn, has created interest in determining the least amount of information that can be sent over a channel while maintaining the perceived quality of the reconstructed speech. If speech is transmitted by simply sampling and digitizing, a data rate on the order of sixty-four kilobits per second (kbps) is required to achieve a speech quality of conventional analog telephone. However, through the use of speech analysis, followed by the appropriate coding, transmission, and resynthesis at the receiver, a significant reduction in the data rate can be achieved. Devices that employ techniques to compress speech by extracting parameters that relate to a model of human speech generation are called speech coders. A speech coder divides the incoming speech signal into blocks of time, or analysis frames. Speech coders typically comprise an encoder and a decoder, or a codec. The encoder analyzes the incoming speech frame to extract certain relevant parameters, and then quantizes the parameters into binary representation, i.e., to a set of bits or a binary data packet. The data packets are transmitted over the communication channel to a receiver and a decoder. The decoder processes the data packets, unquantizes them to produce the parameters, and then resynthesizes the speech frames using the unquantized parameters. The function of the speech coder is to compress the digitized speech signal into a low-bit-rate signal by removing all of the natural redundancies inherent in speech. The digital compression is achieved by representing the input speech frame with a set of parameters and employing quantization to represent the parameters with a set of bits. If the input speech frame has a number of bits N, and the data packet produced by the speech coder has a number of bits N0, the compression factor achieved by the speech coder is Cr = N,/N . The challenge is to retain high voice quality of the decoded speech while achieving the target compression factor. The performance of a speech coder depends on (1) how well the speech model, or the combination of the analysis and synthesis process described above, performs, and (2) how well the parameter quantization process is performed at the target bit rate of NG bits per frame. The goal of the speech model is thus to capture the essence of the speech signal, or the target voice quality, with a small set of parameters for each frame.
One effective technique to encode speech efficiently at low bit rate is multimode coding. A multimode coder applies different modes, or encoding- decoding algorithms, to different types of input speech frames. Each mode, or encoding-decoding process, is customized to represent a certain type of speech segment (i.e., voiced, unvoiced, or background noise) in the most efficient manner. An external mode decision mechanism examines the input speech frame and make a decision regarding which mode to apply to the frame. Typically, the mode decision is done in an open-loop fashion by extracting a number of parameters out of the input frame and evaluating them to make a decision as to which mode to apply. Thus, the mode decision is made without knowing in advance the exact condition of the output speech, i.e., how similar the output speech will be to the input speech in terms of voice-quality or any other performance measure. An exemplary open-loop mode decision for a speech codec is described in U.S. Patent No. 5,414,796, which is assigned to the assignee of the present invention and fully incorporated herein by reference.
Multimode coding can be fixed-rate, using the same number of bits N0 for each frame, or variable-rate, in which different bit rates are used for different modes. The goal in variable-rate coding is to use only the amount of bits needed to encode the codec parameters to a level adequate to obtain the target quality. As a result, the same target voice quality as that of a fixed-rate, higher-rate coder can be obtained at a significant lower average-rate using variable-bit-rate (VBR) techniques. Conventional VBR speech coders are designed with modes having different bit-rates. An exemplary variable rate speech coder is described in U.S. Patent No. 5,414,796, assigned to the assignee of the present invention and previously fully incorporated herein by reference. The codec described in the aforesaid patent has the following four rates: (1) full rate (FR); (2) half rate (HR); (3) quarter rate (QR); and (4) eighth rate (ER). For the foregoing rates, each frame of speech is encoded by 160, eighty, forty, and twenty bits per frame, respectively. An external open-loop mode decision is made regarding which mode (FR, HR, QR or ER) to apply to the input speech frame.
There is presently a surge of research interest and strong commercial needs to develop a high-quality speech coder operating at medium to low bit rates (i.e., in the range of 2.4 to 4 kbps and below). The application areas include wireless telephony, satellite communications, Internet telephony, various multimedia and voice-streaming applications, voice mail, and other voice storage systems. The driving forces are the need for high capacity and the demand for robust performance under packet loss situations. Various recent speech coding standardization efforts are another direct driving force propelling research and development of low-rate speech coding algorithms. A low-rate speech coder creates more channels, or users, per allowable application bandwidth, and a low- rate speech coder coupled with an additional layer of suitable channel coding can fit the overall bit-budget of coder specifications and deliver a robust performance under channel error conditions. Conventional speech coders typically use some form of prediction mechanism to encode the current frame. Thus, to encode the current frame, a speech coder exploits and uses the information contained in the last decoded and recreated frame. This works well because there is typically strong correlation, or similarity, between successive frames. Thus, a frame or short segment of speech, Scur(n), where n=l,2,...,N, having N samples can be encoded by a predictive method to form the encoded frame , Scur_qιιantl ed(n), according to the following equation:
where "*" represents a convolution operation, P(n) is a conventional prediction filter that produces an approximation of current frame from past quantized frame, tJTie quantized version of the prediction error Ecur(n) of the current frame. The prediction error is defined as Ecur(n) = Scur(n) - Scur prtH.1lctl, (π). The performance of the prediction scheme is often measured by a signal-to- noise ratio (SNR) or a perceptual SNR (PSNR), typically defined as:
where W(n), for n=l,2,...,N, is a perceptual weight factor and Ntur(n) is the error of the overall coding process. The error of the overall coding process is defined as Ncur(n) = Scur(n) - Scur quantl,ed(n)- F°r ordinary SNR, W(n) is set equal to 1 for all n=l,2,...,N. If the error N ur decreases, the performance of the prediction-based speech coding scheme, or the SNR, will increase. It is therefore advantageous to minimze the error Ncur. The equation
" S'cur.predicte r'JJ + Ecur(n)]
= Prediction-Error + Error in-the-Quantization-of-Prediction-Error-Signal
indicates that the overall error Ncur depends on how well the prediction is performed, and how well the prediction error is quantized. The prediction filter information is necessarily sent to the decoder as a certain number of bits, Np. The remaining available bits, No - Np, can be used to encode the prediction error signal Ecur. If the prediction from the quantized past frame, Sprev qunntlzed, generates an excellent predicted representation Scur_predlcted of the current frame Scur, the prediction error Ecur will be small, having a low dynamic range. Hence, it will be relatively easy to encode the prediction error Ecur with a small number of bits.
For high-bit-rate predictive speech coders such as, e.g., the QCELP" 13k vocoder manufactured by QUALCOMM INCORPORATED, the total number of bits per frame, No, is high. The QCELP^, for example, supports 260 bits per 20-ms frame. Therefore, even after allocating a number of bits, Np, to quantize the prediction filter parameter, there are enough remaining bits, No-Np, to accurately encode the prediction error. However, at low bit rates (e.g., 4 kbps and below), the total amount of bits available (i.e., eighty or less per frame) is not large enough to accurately encode both the prediction filter parameters and the prediction error signal. Consequently, the overall coding error N ur grows large, resulting in poor performance and producing a quantized version Scur_qunnl,zcd of the current frame that could be quite different from the original frame Scur As the encoding of the next frame depends upon how well the current frame is encoded, the poor performance can degrade the performance of prediction of future frames as well. Thus, there is a need for a variable-rate, multimode, predictive coder that is capable of producing high-voice-quality at low bit rates.
SUMMARY OF THE INVENTION
The present invention is directed to a variable-rate, multimode, predictive coder that is capable of producing high-voice-quality at low bit rates. Accordingly, in one aspect of the invention, a speech coder advantageously includes a codec configured to operate in at least one of a plurality of coding modes; and a closed- loop mode decision module coupled to the codec and configured to apply a first coding mode from the plurality of coding modes to an input speech frame, the first coding mode having a first bit rate that is lower than the bit rate of any other coding mode of the plurality of coding modes, the closed-loop mode decision module being further configured to obtain a performance measure of the codec, compare the performance measure with a threshold value, and, if the performance measure does not exceed the threshold value, reject the first coding mode in favor of a second coding mode having a second bit rate that is greater than the first bit rate. In another aspect of the invention, a method of coding speech frames advantageously includes the steps of selecting a first coding mode to apply to a speech frame, the first coding mode having a first bit rate; obtaining a coding performance measure; comparing the coding performance measure with a threshold value; and rejecting the first coding mode in favor of a second coding mode if the coding performance measure does not exceed the threshold value, the second coding mode having a second bit rate that exceeds the first bit rate.
In another aspect of the invention, a speech coder advantageously includes means for selecting a first coding mode to apply to a speech frame, the first coding mode having a first bit rate; means for obtaining a coding performance measure; means for comparing the coding performance measure with a threshold value; and means for rejecting the first coding mode in favor of a second coding mode if the coding performance measure does not exceed the threshold value, the second coding mode having a second bit rate that exceeds the first bit rate.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of a communication channel terminated at each end by speech coders.
FIG. 2 is a block diagram of an encoder.
FIG. 3 is a block diagram of a decoder. FIG. 4 is a flow chart illustrating the steps of a closed-loop, multimode, predictive coding technique for speech frames at low bit rates.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
In FIG. 1 a first encoder 10 receives digitized speech samples s(n) and encodes the samples s(n) for transmission on a transmission medium 12, or communication channel 12, to a first decoder 14. The decoder 14 decodes the encoded speech samples and synthesizes an output speech signal sSYNTH(n). For transmission in the opposite direction, a second encoder 16 encodes digitized speech samples s(n), which are transmitted on a communication channel 18. A second decoder 20 receives and decodes the encoded speech samples, generating a synthesized output speech signal sSYNTH(n).
The speech samples s(n) represent speech signals that have been digitized and quantized in accordance with any of various methods known in the art including, e.g., pulse code modulation (PCM), companded μ-law, or A-law. As known in the art, the speech samples s(n) are organized into frames of input data wherein each frame comprises a predetermined number of digitized speech samples s(n). In an exemplary embodiment, a sampling rate of 8 kHz is employed, with each 20 ms frame comprising 160 samples. In the embodiments described below, the rate of data transmission may advantageously be varied on a frame-to- frame basis from 8 kbps (full rate) to 4 kbps (half rate) to 2 kbps (quarter rate) to 1 kbps (eighth rate). Varying the data transmission rate is advantageous because lower bit rates may be selectively employed for frames containing relatively less speech information. As understood by those skilled in the art, other sampling rates, frame sizes, and data transmission rates may be used. The first encoder 10 and the second decoder 20 together comprise a first speech coder, or speech codec. Similarly, the second encoder 16 and the first decoder 14 together comprise a second speech coder. It is understood by those of skill in the art that speech coders may be implemented with a digital signal processor (DSP), an application-specific integrated circuit (ASIC), discrete gate logic, firmware, or any conventional programmable software module and a microprocessor. The software module could reside in RAM memory, flash memory, registers, or any other form of writable storage medium known in the art. Alternatively, any conventional processor, controller, or state machine could be substituted for the microprocessor. Exemplary ASICs designed specifically for speech coding are described in U.S. Patent No. 5,727,123, assigned to the assignee of the present invention and fully incorporated herein by reference, and U.S. Application Serial No. 08/197,417, entitled VOCODER ASIC, filed February 16, 1994, assigned to the assignee of the present invention, and fully incorporated herein by reference.
In FIG. 2 an encoder 100 that may be used in a speech coder includes a mode decision module 102, a pitch estimation module 104, an LP analysis module 106, an LP analysis filter 108, an LP quantization module 110, and a residue quantization module 112. Input speech frames s(n) are provided to the mode decision module 102, the pitch estimation module 104, the LP analysis module 106, and the LP analysis filter 108. The mode decision module 102 produces a mode index IM and a mode M based upon the periodicity of each input speech frame s(n). Various methods of classifying speech frames according to periodicity are described in U.S. Application Serial No. 08/815,354, entitled METHOD AND APPARATUS FOR PERFORMING REDUCED RATE VARIABLE RATE VOCODING, filed March 11, 1997, assigned to the assignee of the present invention, and fully incorporated herein by reference. Such methods are also incorporated into the Telecommunication Industry Association Industry Interim Standards TIA/EIA IS-127 and TIA/EIA IS-733. The pitch estimation module 104 produces a pitch index IP and a lag value
P0 based upon each input speech frame s(n). The LP analysis module 106 performs linear predictive analysis on each input speech frame s(n) to generate an LP parameter a. The LP parameter a is provided to the LP quantization module 110. The LP quantization module 110 also receives the mode M. The LP quantization module 110 produces an LP index ILP and a quantized LP parameter u . The LP analysis filter 108 receives the quantized LP parameter a in addition to the input speech frame s(n). The LP analysis filter 108 generates an LP residue signal R[n], which represents the error between the input speech frames s(n) and the quantized linear predicted parameters . The LP residue R[n], the mode M, and the quantized LP parameter a are provided to the residue quantization module 112. Based upon these values, the residue quantization module 112 produces a residue index I ana1 a quantized residue signal — .
In FIG. 3 a decoder 200 that may be used in a speech coder includes an LP parameter decoding module 202, a residue decoding module 204, a mode decoding module 206, and an LP svnthesis filter 208. The mode decoding module 206 receives and decodes a mode index IM, generating therefrom a mode M. The LP parameter decoding module 202 receives the mode M and an LP index ILP. The LP parameter decoding module 202 decodes the received values to produce a quantized LP parameter a . The residue decoding module 204 receives a residue index Iκ, a pitch index IP, and the mode index IM. The residue decoding module 204 decodes the received values to generate a quantized residue signal ^['m + i . The quantized residue signal R[n] and the quantized LP parameter a are provided to the LP synthesis filter 208, which synthesizes a decoded output speech signal s[n] therefrom.
Operation and implementation of the various modules of the encoder 100 of FIG. 2 and the decoder of FIG. 3 are known in the art, and are described in detail in L.B. Rabiner & R.W. Schafer Digital Processing of Speech Signals 396-453 (1978), which is fully incorporated herein by reference. An exemplary encoder and an exemplary decoder are described in U.S. Patent No. 5,414,796, previously fully incorporated herein by reference. In one embodiment a multimode coder first uses an open-loop decision mode, relying on parameters extracted out of the current frame to classify the current frame as background-noise /silence (N), unvoiced speech (UV), or voiced speech (V). Various speech classification methods used for rate determination are known in the art, including methods described in the aforementioned U.S. Patent No. 5,414,796, previously fully incorporated herein by reference. N-type frames are coded with an eighth-rate mode, and UV-type frames are coded with a quarter- rate mode.
For V-type frames (i.e., voiced speech frames), either a higher-rate (No=Nl bits per frame) mode such as full rate, or a lower-rate (No=N2 bits per frame, where N2 < Nl) mode such as half rate, is used. The full-rate mode may advantageously be a prediction-based coding scheme with adequate bits to accurately encode various types of voiced speech, delivering a perceptual signal- to-noise ratio (PSNR) well above the target PSNR (a predefined or variable threshold value). The half-rate mode is advantageously a prediction-based coding scheme designed to encode frames a high degree of correlation with the previous frame (i.e., frames that are quite similar to the previous frame). Thus, the number of bits available in the half-rate mode, N2 bits per frame, is adequate to encode the prediction parameters for frames with high correlation, as well as the prediction error, which is relatively small due to the high correlation between successive frames. Such frames are typically encountered in steady voiced speech segments, which are therefore amenable to half-rate coding. Additionally, the performance of prediction-based coding schemes also depends on how accurately the previous frame is quantized. Hence, a closed-loop mode selection process is employed after the open-loop mode to ensure that the coding performance exceeds the predefined (or variable) target PSNR value. As those of skill in the art would understand, the open-loop mode need not necessarily be applied at all.
The flow chart of FIG. 4 illustrates a closed-loop, multimode, predictive coding technique for speech frames at low bit rates, in accordance with one embodiment. In step 300 a frame number counter is set equal to 1. The algorithm then proceeds to step 302, starting the coding process. The algorithm then proceeds to step 304. In step 304 the algorithm checks the current frame and the previous quantized frame. The algorithm then proceeds to step 306. In step 306 the algorithm determines whether the current frame should be classified as silence or background noise. This determination is made in accordance with various conventional techniques for measuring frame energy, such as, e.g., calculating the sum-of-squares. If the frame is classified as silence or background noise, the algorithm proceeds to step 308. In step 308 the algorithm applies an eighth-rate coding mode to the frame. The algorithm then proceeds to step 310. If, on the other hand, in step 306 the frame is not classified as background noise or silence, the algorithm proceeds to step 312. In step 312 the algorithm determines whether the current frame should be classified as unvoiced speech. This determination is made in accordance with various known methods of periodicity determination, such as, e.g., the use of zero crossings and normalized autocorrelation functions (NACFs). These techniques are described in the aforementioned U.S. Application Serial No. 08/815,354, previously fully incorporated herein by reference. If the frame is classified as unvoiced speech, the algorithm proceeds to step 314. In step 314 a quarter-rate coding mode is applied to the frame. The algorithm then proceeds to step 310. If, on the other hand, in step 312 the frame is not classified as unvoiced speech, the algorithm proceeds to step 316, considering the frame to contain voiced speech. In step 316 the algorithm goes to a half-rate prediction-based coding mode. The algorithm then proceeds to step 318. In step 318 the PSNR is computed. The algorithm then proceeds to step 320.
In step 320 the algorithm determines whether the computed PSNR is greater than a predefined threshold, or target, PSNR value. As an alternative, the threshold, or target, PSNR value may be a function of average bit rate. For example, the average bit rate is calculated periodically and fed back to the algorithm, which adjusts the target threshold value accordingly. Further, it should be understood that any conventional measure of performance may be substituted for PSNR. If the computed PSNR exceeds the target PSNR, the algorithm proceeds to step 322. In step 322 a half-rate coding mode is applied to the frame. The algorithm then proceeds to step 310. If, on the other hand, in step 320 the computed PSNR does not exceed the target PSNR, the algorithm proceeds to step 324. In step 324 the algorithm applies a full-rate coding mode to the frame. The algorithm then proceeds to step 310.
In step 310 the frame number counter is incremented by 1. The algorithm then proceeds to step 326. In step 326 the algorithm determines whether the frame number counter value is greater than or equal to the total number of frames that must be processed (i.e., whether there are any remaining frames to process). If the frame number counter value is less than the total number of frames to be processed, the algorithm returns to step 302, beginning the coding process for the next frame. If, on the other hand, the frame number counter value is greater than or equal to the total number of frames to be processed, the algorithm proceeds to step 328, ending the coding process.
In alternate embodiments the full-rate coding mode described above with respect to FIG. 4 could be a higher-bit-rate predictive mechanism (i.e., any bit rate that is greater than half-rate). In one embodiment a higher-bit-rate, direct coding mechanism is substituted for the full-rate, predictive coding mode. The direct coding mode encodes the current speech frame or residue without using any information from the previous frame.
The use of a direct encoding method is appropriate for speech segments for which there is no similarity between the current frame and the previous frame. An example is during the onset of a voice segment. Another example is unvoiced-to- voiced segment transitions. A direct encoding method is also useful in the middle of voiced segments when the cumulative effect of prediction-based encoding has degraded the past quantized frame so as to be too far out of sync with the corresponding original speech frame. In this case predictive coding will fail, even at much higher bit rates, due to the lack of similarity between the past quantized frame and the past original frame. In such a case, a fresh capture of the current frame with a direct encoding method will not only enhance the preservation of the current frame, but will also facilitate future prediction-based encoding of the next and later frames because the prediction mechanism will be aided by a more accurate memory.
Those of skill would understand that while the embodiments described above contemplate four bit rates, any reasonable number of bit rates could be substituted for four. Those of skill would further appreciate that the embodiments described herein could be extended to analysis over a number of frames that is greater than one, at the expense of additional processing time or capability.
In one embodiment two modes may be employed, with bit rates Rl and R2. The Rl coding method is a higher-rate, direct coding method. The R2 coding method is a lower-rate, predictive coding method. A closed-loop decision is performed such that the R2 coding method is tried first, the performance is checked by comparing with a performance measure, and the algorithm switches to the Rl coding method if the performance for the R2 coding mode is insufficient. In an alternate embodiment, the higher-rate, Rl coding mode is tried first, the performance is checked by comparing with a performance measure, and, if the performance is satisfactory, the lower-rate, R2 coding mode is tried. The performance check is then performed for the R2 coding mode, and if the R2 coding mode performance is unsatisfactory, the Rl coding mode is applied to the frame.
In another embodiment multiple coding modes having bit rates R1,R2,...,RN-1,RN (where R1>R2>...>RN-1>RN) are employed. A closed-loop decision is performed such that the lowest rate, RN, is tried first. If the RN coding mode performs adequately, the RN coding mode is retained for the frame. Otherwise, the next, higher-rate coding mode, RN-1, is applied. The process is reiterated until either a coding mode performs adequately or the highest-rate mode, Rl, is retained. In an alternate embodiment, the highest rate, Rl, is tried first. If the Rl mode performs adequately, the next, lower-rate coding mode, R2, is tried. The process is continued until a given coding mode does not perform adequately (at which time the last coding mode to perform adequately is applied), or until the lowest-rate coding mode, RN, performs satisfactorily and is applied.
In another embodiment multiple coding modes having bit rates Rl,R2,...,Rm-l,Rm,Rm+l,...,RN are employed. The bit rates have the following relative magnitudes: Rl>R2>Rm-l>Rm>Rm+l>RN. A closed-loop mode decision works in conjunction with an open-loop mode decision. The open-loop mode decision, based upon parameters such as frame energy or frame periodicity, tells the coder to apply a mode with a bit rate of Rm, at which point the closed-loop mode decision takes over. The closed-loop mode decision applies the Rm coding mode, tests performance, and maintains the Rm coding mode if performance is satisfactory. Otherwise, the closed-loop mode decision tries the next, higher-rate coding mode, Rm-1. The process is reiterated until either a coding mode performs adequately or the highest-rate mode, Rl, is retained. Alternatively, the closed-loop mode decision applies the Rm coding mode, tests performance, and maintains the Rm coding mode if performance is satisfactory. Otherwise, the closed-loop mode decision tries the next, lower-rate coding mode, Rm+1. The process is reiterated until either a coding mode performs inadequately (at which time the last coding mode to perform adequately is applied), or the lowest-rate mode, RN, is retained.
In another embodiment multiple coding modes having bit rates R1,R2,...,RN (where R1>R2>...>RN) are employed. All of the coding modes are applied in parallel to the input speech frame, and the performances of the coding modes are compared with a set of N threshold performance measures. The coding mode that appears to produce the most accurate result is selected.
In another embodiment multiple coding modes having bit rates R1,R2,...,RN (where R1>R2>...>RN) are employed. All of the coding modes are applied in parallel to the input speech frame, and the performances of the coding modes are compared with a set of N threshold performance measures. If several coding modes exceed the performance threshold target, the coding mode having the lowest bit rate (and also performing above the performance threshold) is selected.
In another embodiment multiple coding modes having bit rates Rl,R2,...,Quarter Rate,..., Half Rate,...,RN (where Rl is Full Rate and RN is Eighth Rate) are employed. A closed-loop mode decision works in conjunction with an open-loop mode decision. The open-loop mode decision, based upon parameters such as frame energy or frame periodicity, tells the coder to apply the full-rate coding mode to unvoiced-to-voiced transition frames, voiced-to-voiced transition frames, nonstationary voiced segments, and nonstationary unvoiced segments. Also based upon frame parameters, the open-loop mode decision tells the coder to apply the half-rate coding mode to steady-voiced segments that exhibit a significant degree of similarity from frame to frame. Also based upon frame parameters, the open-loop mode decision tells the coder to apply the quarter-rate coding mode to steady unvoiced segments. Also based upon frame parameters, the open-loop mode decision tells the coder to apply the eighth-rate coding mode to background noise and other nonspeech signals such as silence. Once the open- loop mode decision has selected a coding mode for application to the frame, the closed-loop mode decision takes over. The closed-loop mode decision applies the coding mode selected by the open-loop mode decision, tests performance, and maintains the selected coding mode if performance is satisfactory. Otherwise, the closed-loop mode decision tries the next, higher-rate coding mode. The process is reiterated until either a coding mode performs adequately or the full-rate mode is retained. Alternatively, the closed-loop mode decision applies the coding mode selected by the open-loop mode decision, tests performance, and maintains the selected coding mode if performance is satisfactory. Otherwise, the closed-loop mode decision tries the next, lower-rate coding mode. The process is reiterated until either a coding mode performs inadequately (at which time the last coding mode to perform adequately is applied), or the lowest-rate mode is retained.
In another embodiment a multimode coder includes a first set of N modes, Mi, and the first set of modes has respective bit rates Ri, where i=l,2,...,N. The coder also has a second set of N modes, MCCi, and the second set of modes has respective bit rates RCCi, where i=l,2,...,N. The MCCi and Mi coding modes each use the same source-coding mode (i.e., the same encoder and decoder). However, the MCCi coding mode includes an additional layer of channel protection, in which (RCCi-Ri) bits are used for robust protection of the parameters of the Mi coding mode under the worst possible channel condition of the communication system. Hence, the performance, or voice quality, delivered by the Mi coding mode under channel-error-free conditions is similar to the performance, or voice quality, delivered by the MCCi coding mode under the worst possible channel error condition. The (RCCi-Ri) channel coding bits serve to provide adequate protection under the assumed, or target, worst channel condition. The assumed worst channel condition may advantageously be, e.g., a predefined percentage of frame error rate (FER). In this particular embodiment, a closed-loop mode decision advantageously accounts for both channel variation and source variation to deliver a guaranteed quality of service. For example, a source-controlled, closed-loop mode decision such as described above is applied first. The closed-loop mode decision tells the coder to use the Mi coding mode. An external, network-control indicator SW, which is a signal provided by the communication network to the speech encoder, indicates whether the communication channel is in good condition (e.g., if SW=1, the channel is error-free) in bad condition (e.g., if SW=0, the channel is erroneous). If the channel is in good condition, the coding mode Mi, having bit rate Ri, is used. If, on the other hand, the channel is in bad condition, the coding mode MCCi, having bit rate RCCi, is used.
Those skilled in the art would appreciate that the number of network conditions need not be restricted to two. Thus, in one embodiment, a multimode coder is designed to account for j=l,2,..,M different possible network conditions by providing M different modes MCCi,j having rates RCCi,j, where j=l,2,..,M, for each original source-controlled coding mode Mi. Such a scheme allows for varied amounts of channel coding because (RCCi,j-RCCi) represents the minimum number of bits needed to add channel error protection to the channel coding layer so that the channel error protection will be adequate for the worst-case scenario in the j-th channel error condition. The source-controlled, closed-loop mode decision then determines which coding mode Mi to apply first, and, based on the value of SW=j (where j=l,2,..,M), selects the coding mode MCCi,j. Such a closed-loop, combined-network-and-source-controlled codec delivers guaranteed quality of service across various channel conditions while also delivering a low average bit rate.
Preferred embodiments of the present invention have thus been shown and described. It would be apparent to one of ordinary skill in the art, however, that numerous alterations may be made to the embodiments herein disclosed without departing from the spirit or scope of the invention. Therefore, the present invention is not to be limited except in accordance with the following claims. What is claimed is:

Claims

1. A speech coder, comprising: a codec configured to operate in at least one of a plurality of coding modes; and a closed-loop mode decision module coupled to the codec and configured to apply a first coding mode from the plurality of coding modes to an input speech frame, the first coding mode having a first bit rate that is lower than the bit rate of any other coding mode of the plurality of coding modes, the closed- loop mode decision module being further configured to obtain a performance measure of the codec, compare the performance measure with a threshold value, and, if the performance measure does not exceed the threshold value, reject the first coding mode in favor of a second coding mode having a second bit rate that is greater than the first bit rate.
2. The speech coder of claim 1, wherein the closed-loop mode decision module is configured to continue a process of selecting and, based on performance, rejecting coding modes chosen successively in order of increasing bit rate.
3. The speech coder of claim 1, wherein the performance based measure is obtained by comparing a resultant synthetic speech frame with the input speech frame.
4. The speech coder of claim 1, wherein the first coding mode is a prediction-based coding mode and the second coding mode is a direct coding mode.
5. The speech coder of claim 1, further comprising an open-loop mode decision module coupled to the codec and configured to select one of the plurality of coding modes for application to the input speech frame before the closed-loop mode decision module applies a coding mode, wherein the closed-loop mode decision module is configured to first apply the coding mode selected by the open- loop mode decision module.
6. The speech coder of claim 2, further comprising an open-loop mode decision module coupled to the codec and configured to select one of the plurality of coding modes for application to the input speech frame before the closed-loop mode decision module applies a coding mode, wherein the closed-loop mode decision module is configured to first apply the coding mode selected by the open- loop mode decision module.
7. The speech coder of claim 1, wherein the threshold value is a predefined quantity.
8. The speech coder of claim 1, wherein the threshold value is a function of average bit rate.
9. A method of coding speech frames, comprising the steps of: selecting a first coding mode to apply to a speech frame, the first coding mode having a first bit rate; obtaining a coding performance measure; comparing the coding performance measure with a threshold value; and rejecting the first coding mode in favor of a second coding mode if the coding performance measure does not exceed the threshold value, the second coding mode having a second bit rate that exceeds the first bit rate.
10. The method of claim 9, further comprising the step of repeating the obtaining, comparing, and rejecting steps in successive order until the coding performance measure exceeds the threshold value.
11. The method of claim 9, wherein the obtaining step comprises comparing a resultant synthetic speech frame with the speech frame.
12. The method of claim 9, wherein the first coding mode is a prediction- based coding mode and the second coding mode is a direct coding mode.
13. The method of claim 9, wherein the selecting step comprises selecting a first coding mode based upon parameters of the speech frame.
14. The method of claim 10, wherein the selecting step comprises selecting a first coding mode based upon parameters of the speech frame.
15. The method of claim 9, wherein the comparing step comprises comparing the coding performance measure with a predefined threshold value.
16. The method of claim 9, wherein the comparing step comprises comparing the coding performance measure with a threshold value that is a function of average bit rate.
17. A speech coder, comprising: means for selecting a first coding mode to apply to a speech frame, the first coding mode having a first bit rate; means for obtaining a coding performance measure; means for comparing the coding performance measure with a threshold value; and means for rejecting the first coding mode in favor of a second coding mode if the coding performance measure does not exceed the threshold value, the second coding mode having a second bit rate that exceeds the first bit rate.
18. The speech coder of claim 17, further comprising means for continuing to obtain the performance measure, compare the performance measure with the threshold value, and reject coding modes in favor of other coding modes having greater bit rates until the coding performance measure exceeds the threshold value.
19. The speech coder of claim 17, wherein the means for obtaining comprises means for comparing a resultant synthetic speech frame with the speech frame.
20. The speech coder of claim 17, wherein the first coding mode is a prediction-based coding mode and the second coding mode is a direct coding mode.
21. The speech coder of claim 17, wherein the means for selecting comprises means for selecting a first coding mode based upon parameters of the speech frame.
22. The speech coder of claim 18, wherein the means for selecting comprises means for selecting a first coding mode based upon parameters of the speech frame.
23. The speech coder of claim 17, wherein the threshold value is a predefined quantity.
24. The speech coder of claim 17, wherein the threshold value is a function of average bit rate.
EP99957560A 1998-11-13 1999-11-12 Closed-loop variable-rate multimode predictive speech coder Withdrawn EP1129451A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US191643 1980-09-26
US19164398A 1998-11-13 1998-11-13
PCT/US1999/026850 WO2000030075A1 (en) 1998-11-13 1999-11-12 Closed-loop variable-rate multimode predictive speech coder

Publications (1)

Publication Number Publication Date
EP1129451A1 true EP1129451A1 (en) 2001-09-05

Family

ID=22706319

Family Applications (1)

Application Number Title Priority Date Filing Date
EP99957560A Withdrawn EP1129451A1 (en) 1998-11-13 1999-11-12 Closed-loop variable-rate multimode predictive speech coder

Country Status (5)

Country Link
EP (1) EP1129451A1 (en)
JP (1) JP2002530706A (en)
KR (1) KR20010087393A (en)
AU (1) AU1524300A (en)
WO (1) WO2000030075A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6330532B1 (en) * 1999-07-19 2001-12-11 Qualcomm Incorporated Method and apparatus for maintaining a target bit rate in a speech coder
US6438518B1 (en) * 1999-10-28 2002-08-20 Qualcomm Incorporated Method and apparatus for using coding scheme selection patterns in a predictive speech coder to reduce sensitivity to frame error conditions
JP3404024B2 (en) 2001-02-27 2003-05-06 三菱電機株式会社 Audio encoding method and audio encoding device
US8532984B2 (en) 2006-07-31 2013-09-10 Qualcomm Incorporated Systems, methods, and apparatus for wideband encoding and decoding of active frames
US8260609B2 (en) 2006-07-31 2012-09-04 Qualcomm Incorporated Systems, methods, and apparatus for wideband encoding and decoding of inactive frames
US8725499B2 (en) 2006-07-31 2014-05-13 Qualcomm Incorporated Systems, methods, and apparatus for signal change detection
JP2008170488A (en) 2007-01-06 2008-07-24 Yamaha Corp Waveform compressing apparatus, waveform decompressing apparatus, program and method for producing compressed data
CN102254562B (en) * 2011-06-29 2013-04-03 北京理工大学 Method for coding variable speed audio frequency switching between adjacent high/low speed coding modes

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0398318A (en) * 1989-09-11 1991-04-23 Fujitsu Ltd Voice coding system
RU2129737C1 (en) * 1994-02-17 1999-04-27 Моторола, Инк. Method for group signal encoding and device which implements said method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0030075A1 *

Also Published As

Publication number Publication date
WO2000030075A1 (en) 2000-05-25
AU1524300A (en) 2000-06-05
KR20010087393A (en) 2001-09-15
JP2002530706A (en) 2002-09-17

Similar Documents

Publication Publication Date Title
EP1340223B1 (en) Method and apparatus for robust speech classification
US7203638B2 (en) Method for interoperation between adaptive multi-rate wideband (AMR-WB) and multi-mode variable bit-rate wideband (VMR-WB) codecs
JP5543405B2 (en) Predictive speech coder using coding scheme patterns to reduce sensitivity to frame errors
EP1129450B1 (en) Low bit-rate coding of unvoiced segments of speech
KR100798668B1 (en) Method and apparatus for coding of unvoiced speech
EP1214705B1 (en) Method and apparatus for maintaining a target bit rate in a speech coder
KR20020081374A (en) Closed-loop multimode mixed-domain linear prediction speech coder
EP1181687B1 (en) Multipulse interpolative coding of transition speech frames
EP1204968B1 (en) Method and apparatus for subsampling phase spectrum information
US6434519B1 (en) Method and apparatus for identifying frequency bands to compute linear phase shifts between frame prototypes in a speech coder
JP2002536694A (en) Method and means for 1/8 rate random number generation for voice coder
WO2000030075A1 (en) Closed-loop variable-rate multimode predictive speech coder

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20010507

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20030603

RBV Designated contracting states (corrected)

Designated state(s): DE FR GB