WO1994012972A1 - Procede et appareil pour la quantification des amplitudes d'harmoniques - Google Patents

Procede et appareil pour la quantification des amplitudes d'harmoniques Download PDF

Info

Publication number
WO1994012972A1
WO1994012972A1 PCT/US1993/011578 US9311578W WO9412972A1 WO 1994012972 A1 WO1994012972 A1 WO 1994012972A1 US 9311578 W US9311578 W US 9311578W WO 9412972 A1 WO9412972 A1 WO 9412972A1
Authority
WO
WIPO (PCT)
Prior art keywords
spectral
frame
generating
harmonics
reconstructed
Prior art date
Application number
PCT/US1993/011578
Other languages
English (en)
Other versions
WO1994012972A9 (fr
Inventor
Jae S. Lim
John C. Hardwick
Original Assignee
Digital Voice Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Voice Systems, Inc. filed Critical Digital Voice Systems, Inc.
Priority to AU56824/94A priority Critical patent/AU5682494A/en
Publication of WO1994012972A1 publication Critical patent/WO1994012972A1/fr
Publication of WO1994012972A9 publication Critical patent/WO1994012972A9/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0018Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/06Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/10Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a multipulse excitation

Definitions

  • This invention relates in general to methods for coding speech. It relates specifically to an improved method for quantizing harmonic amplitudes of a parameter representing a segment of sampled speech.
  • vocoders speech coders
  • linear prediction vocoders homomorphic vocoders
  • channel vocoders channel vocoders.
  • speech is modeled on a short-time basis as the response of a linear system excited by a periodic impulse train for voiced sounds or random noise for unvoiced sounds. Speech is analyzed by first segmenting speech using a window such as a Hamming window.
  • the excitation parameters and system parameters are estimated and quantized.
  • the excitation parameters consist of the voiced/unvoiced decision and the pitch period.
  • the system parameters consist of the spectral envelope or the impulse response of the i system.
  • the quantized excitation parameters are used to synthesize an excitation signal consisting of a periodic impulse train in voiced regions or random noise in unvoiced regions. This excitation signal is then filtered using the quantized system parameters.
  • MBE Multi-Band Excitation
  • the 4800 bps MBE speech coder used a MBE analysis/synthesis system to esti ⁇ mate the MBE speech model parameters and to synthesize speech from the estimated MBE speech model parameters.
  • a discrete speech signal denoted by s (Fig. IB)
  • AS Fig. 1A
  • This is typically done at an 8 kHz sampling rate, although other sampling rates can easily be accommodated through a straightforward change in the various system parameters.
  • the system divides the discrete speech signal s into small overlapping segments by multiplying s with a window (such as a Hamming Window or a Kaiser window) to obtain a windowed signal segment s (n) (Fig. IB) (where n is the segment index).
  • a window such as a Hamming Window or a Kaiser window
  • Each speech signal segment is then transformed from the time domain to the frequency domain to generate segment frames F w (n) (Fig. 1C).
  • Each frame is analyzed to obtain a set of MBE speech model parameters that characterize that frame.
  • the MBE speech model parameters consist of a fundamental frequency, or equivalently, a pitch period, a set of voiced/unvoiced decisions, a set of spectral amplitudes, and optionally a set of spectral phases.
  • model parameters are then quantized using a fixed number of bits (for instance, digital electromagnetic signals) for each frame.
  • the resulting bits can then be used to reconstruct the speech signal (e.g. an electromagnetic signal), by first reconstructing the MBE model parameters from the bits and then synthesizing the speech from the model parameters.
  • a block diagram of the steps taken to code the spectral amplitudes by a typical MBE speech coder such as disclosed in U.S. Patent No. 5,226,084 is shown in Figure 2.
  • the invention described herein applies to many different speech coding methods, which include but are not limited to linear predictive speech coders, channel vocoders, homomorphic vocoders, sinusoidal transform coders, multi-band excitation speech coders and improved multiband excitation (IMBE) speech coders.
  • IMBE multiband excitation
  • a 7.2 kbps IMBE speech coder is used.
  • This coder uses the robust speech model, referred to above as the Multi-Band
  • MBE Excitation
  • INMARSAT-M International Marine Satellite Organization
  • Efficient methods for quantizing the MBE model parameters have been developed. These methods are capable of quantizing the model parameters at virtually any bit rate above 2 kbps.
  • the representative 7.2 kbps IMBE speech coder uses a 50 Hz frame rate. Therefore 144 bits are available per frame. Of these 144 bits, 57 bits are reserved for forward error correction and synchronization. The remaining 87 bits per frame are used to quantize the MBE model parameters, which consist of a fundamental frequency ⁇ , a set of K voiced/unvoiced decisions and a set of L spectral amplitudes M j . The values of K and L vary depending on the fundamental frequency of each frame. The 87 available bits are divided among the model parameters as shown in Table 1.
  • parameters designated with the hat accent ( ⁇ ) are the parameters as determined by the encoder, before they have been quantized or transmitted to the decoder.
  • Parameters designated with the tilde accent ( ⁇ ) are the corresponding parameters, that have been reconstructed from the bits to be transmitted, either by the decoder or by the encoder as it anticipatorily mimics the decoder, as explained below.
  • the path from coder to decoder entails quantization of the hat ⁇ parameter, followed by coding and transmission, followed by decoding and reconstruction.
  • the two parameters can differ due to quantization and also due to bit errors introduced in the coding and transmission process.
  • the coder uses the ⁇ parameters to anticipate action that the decoder will take. In such instances, the ⁇ parameters used by the coder have been quantized, and reconstructed, but will not have been subject to possible bit errors.
  • the fundamental frequency ⁇ is quantized by first converting it to its equivalent pitch period. Estimation of the fundamental frequency is described in detail in U.S. Patent Nos. 5,226,084 and 5,247,579.
  • is typically restricted to a range, ⁇ is quantized by converting it to a pitch period.
  • is quantized by converting it to a pitch period.
  • a two step estimation method is used, with an initial pitch period (which is related to the fundamental frequency by a specific function
  • the quantity b 0 can be represented with eight bits using the following unsigned binary representation:
  • Table 2 Eight Bit Binary Representation This binary representation is used throughout the encoding and decoding of the IMBE model parameters.
  • L denotes the number of spectral amplitudes in the frequency domain transform of that segment
  • the value of L is derived from the fundamental frequency for that frame, ⁇ , according to the relationship,
  • Equation (2) (a "floor” function), is equal to the largest integer less than or equal to x.
  • the L spectral amplitudes are denoted by M, for 1 ⁇ 1 ⁇ L where M-, is the lowest frequency spectral amplitude and M ⁇ is the highest frequency spectral amplitude.
  • the fundamental frequency is generated in the decoder by decoding and reconstructing the received value, to arrive at b 0 , from which ⁇ can be generated according to the following:
  • the set of windowed spectral amplitudes for the current speech segment are identified as s w (0) (with the parenthetical numeral 0 indicating the current segment, -1 indicating the preceding segment, +1 indicating the following segment, etc.) are quantized by first calculating a set of predicted spectral amplitudes based on the spectral amplitudes of the previous speech segment s w (-l). The predicted results are compared to the actual spectral amplitudes, and the difference for each spectral amplitude, termed a prediction residual, is calculated. The prediction residuals are passed to and used by the decoder.
  • the general method is shown schematically with reference to Fig. 2 and Fig. 3. (The process is recursive from one segment to the next, and also in some respect, between the coder and the decoder. Therefore, the explanation of the process is necessarily a bit circular, and starts midstream.)
  • the vector M(0) is a vector of L unquantized spectral amplitudes, which define the spectral envelope of the current sampled window s w (0).
  • M(0) is a vector of twenty-one spectral amplitudes, for the harmonic frequencies that define the shape of the spectral envelope for frame F w (0).
  • L in this case is twenty-one.
  • M,(0) represents the 1 th element in the vector, 1 ⁇ 1 ⁇ L .
  • the method includes coding steps 202 (Fig.2), which take place in a coder, and decoding steps 302 (Fig.3), which take place in a separate decoder.
  • the coding steps include the steps discussed above, not shown in Fig. 2: i.e., sampling the analog signal AS; applying a window to the sampled signal AS, to establish a segment s w (n) of sampled speech; transforming the sampled segment s w (n) from the time domain into a frame F (n) in the frequency domain; and identifying the MBE speech model parameters that define that segment, i.e.: fundamental frequency ⁇ ; spectral amplitudes M(0) (also known as samples of the spectral envelope); and voiced/unvoiced decisions.
  • the method uses post transmission prediction in the decoding steps conducted by the decoder as well as differential coding.
  • a packet of data is sent from the coder to the decoder representing model parameters of each spectral segment.
  • the coder does not transmit codes representing the actual full value of the parameter (except for the first frame). This is because the decoder makes a rough prediction of what the parameter will be for the current frame, based on what the decoder determined the parameter to be for the previous frame (based in turn on a combination of what the decoder previously received, and what it had determined for the frame preceding the preceding segment, and so on).
  • the coder only codes and sends the difference between what the decoder will predict and the actual values. These differences are referred to as "prediction residuals.” This vector of differences will, in general, require fewer bits for coding than would the coding of the absolute parameters.
  • the values that the decoder generates as output from summation 316 are the logarithm base 2 of what are referred to as the quantized spectral amplitudes, designated by the vector M. This is distinguished from the vector of unquantized values M, which is the input to log2 block 204 in the coder. (The prediction steps are grouped in the dashed box 340.)
  • the decoder stores the vector log2M(-l) for the previous segment by taking a frame delay 312.
  • the decoder computes the predicted spectral log p amplitudes , according to a method discussed below.
  • the coder cannot communicate with the p decoder, to anticipate the prediction ] J to be made by the decoder, the coder must also make the prediction, as closely as possible to the manner in which the decoder will p make the prediction.
  • the prediction M tne decoder is based on the values log 2 M(-l) for the previous segment generated by the decoder. Therefore, the coder must also generate these values, as if it were the decoder, as discussed below, so that it anticipatorily mirrors the steps that will be taken by the decoder.
  • the coder accurately anticipates the prediction ⁇ at & e decoder will make with respect to the spectral log amplitues Iog2-M(0), the values b to be p transmitted by the encoder will reflect the difference between the prediction ⁇ J and the actual values Iog 2 -M(0).
  • the decoder at 316, upon addition, the result is Iog 2 -M(0) a quantized version of the actual values log2 M(0).
  • the coder during the simulation of the decoder steps at 240, conducts steps that correspond to the steps that will be performed by the decoder, in order for the coder to anticipatorily mirror how the decoder will predict the values for log2 M(0) based on the previous computed values log 2 M(-l). In other words, the coder conducts steps 240 that mimic the steps conducted by the decoder.
  • the coder has previously produced the actual values M(0).
  • the logarithm base two of this vector is taken at 204.
  • the coder subtracts from this logarithm vector, a vector of the p predicted spectral log amplitudes M » calculated at step 214.
  • the coder uses the same p steps for computing the predicted values ⁇ as WU ⁇ 1 ⁇ & decoder, and uses the same inputs as will the decoder, ⁇ (0), ⁇ (-l), which are the reconstructed fundamental frequencies and log2M(-l). It will be recalled, that log 2 M(-l) is the value that the decoder has computed for the previous frame (after the decoder has performed its rough prediction and then adjusted the prediction with addition of the prediction residual values transmitted by the coder).
  • the coder generates log 2 M(-l) by performing the exact steps that the decoder performed to generate log 2 M(-l) .
  • the coder had sent to the decoder, a vector b- ⁇ -l) where 2 ⁇ 1 ⁇ L + 3. (The generation of the vector b] is discussed below.)
  • the coder reconstructs the values of the vector b j O-l) into DCT coefficients as the decoder will do.
  • An inverse DCT transform (or inverse of whatever suitable transform is used in the forward transformation part of the coder at step 206) is performed at 220, and reformation into blocks is conducted at 222.
  • the coder will have produced the same vector as the decoder produces at the output of reformation step 322.
  • this is added to the predicted spectral log amplitudes for the previous frame F w (-2), to arrive at the output from decoder log2M(-l).
  • the result of the summation in the coder at 226, log2-M(-l), is stored by implementing a frame delay 212, after which it is used as discussed above to simulate the decoder's prediction of log 2 M(0).
  • the vector I? ! is generated in the coder as follows.
  • the coder subtracts p the vector M mat the coder calculates the decoder will predict, from the actual values of log2M(0) to produce a vector T.
  • this vector is divided into blocks, for instance six, and at 206 a transform, such as a DCT is performed. Other sorts of transforms, such as Discrete Fourier, may also be used.
  • the output of the DCT transform is organized in two groups: a set of D.C. values, associated into a vector referred to as the Prediction Residual Block Average (PRBA); and the remaining, higher order coefficients, both of which are quantized at 208 and are designated as the vector b,.
  • PRBA Prediction Residual Block Average
  • This quantization method provides very good fidelity using a small number of bits and it maintains this fidelity as L varies over its range.
  • the computational requirements of this approach are well within the limits required for real-time implementation using a single DSP such as the DSP32C available from AT & T.
  • This quantization method separates the spectral amplitudes into a few components, such as the mean of the PRBA vector, that are sensitive to bit errors and a large number of other components that are not very sensitive to bit errors. Forward error correction can then be used in an efficient manner by providing a high degree of protection for the few sensitive components and a lesser degree of protection for the remaining components.
  • M, (0) denotes the spectral amplitudes of the current speech segment
  • M,(-l) denotes the quantized spectral amplitudes of the previous speech segment.
  • the constant ⁇ a decay factor, is typically equal to .7, however any value in the range 0 ⁇ ⁇ ⁇ 1 can be used. The effect and purpose of the constant ⁇ are explained below. For instance, as shown in Fig. lc, L(0) is 21 and L(-l) is 7.
  • the fundamental frequency ⁇ (0) of the current frame F w (0) is 3 ⁇ and the fundamental frequency ⁇ (-l) of the previous frame F w (-1) is ⁇ .
  • each harmonic amplitude can be identified by an index number representing its position along the frequency axis. For instance, for the example set forth above, according to the rudimentary method, the value for the first of the harmonic amplitudes in the current frame, would be predicted to be equal to the value of the first harmonic ampUtude in the previous frame. Similarly, the value of the fourth harmonic amplitude would be predicted to be equal to the value of the fourth harmonic amplitude in the previous frame.
  • the fourth harmonic amplitude in the current frame is closer in value to an interpolation between the amplitudes of the first and second harmonics of the previous frame, rather than to the value of the fourth harmonic.
  • the eighth through twenty-first harmonic amplitudes of the current frame would all have the value of the last L(-l) or seventh harmonic amplitude of the previous frame.
  • kj represents a relative index number. If the ratio of the current to the previous
  • 1 fundamental frequencies is 1/3, as in the example, k ⁇ is equal to — • 1 , for each index number 1.
  • a predicted spectral log amplitude M ° r me ⁇ tn harmonic of the current frame can be expressed as:
  • the predicted value is interpolated between two actual values of the previous frame.
  • the predicted value is a sort of weighted average between the two harmonic amplitudes of the previous frame closest in frequency to the harmonic amplitude in question of the current frame .
  • this value Mi is the value that the decoder will predict for the log amplitude of the harmonic frequencies that define the spectral envelope for the current frame.
  • the coder also generates this prediction value in anticipation of the decoder's prediction, and then calculates a prediction residual vector, T, , essentially equal to the difference between the actual value the coder has generated and the predicted value that the coder has calculated that the decoder will generate:
  • the improved method results are identical to the rudimentary method. In other cases the improved method produces a prediction residual with lower variance than the former method. This allows the prediction residuals to be quantized with less distortion for a given number of bits.
  • the coder does not transmit absolute values from the coder to the decoder. Rather, the coder transmits a differential value, calculated to be the difference between the current value, and a prediction of the current value made on the basis of previous values.
  • the differential value that is received by the decoder can be erroneous, either due to computation errors or bit transmission errors. If so, the error will be incorporated into the current reconstructed frame, and will further be perpetuated into subsequent frames, since the decoder makes a prediction for the next frame based on the previous frame. Thus, the erroneous prediction will be used as a basis for the reconstruction of the next segment
  • the encoder does include a mirror, or duplicate of the portion of the decoder that makes the prediction.
  • the inputs to the duplicate are not values that may have been corrupted during transmission, since, such errors arise unexpectedly in transmission and cannot be duplicated. Therefore, differences can arise between the predictions made by the decoder, and the mirroring predictions made in the encoder. These differences detract from the quality of the coding scheme.
  • the factor ⁇ causes any such error to "decay" away after a number of future segments, so that any errors are not perpetuated indefinitely.
  • FIG.4 Shows schematically in Fig.4.
  • Sub panels A and B of Fig.4 show the effect of a transmitted error with no factor ⁇ (which is the same as ⁇ equal to 1).
  • the amplitude of a single spectral harmonic is shown for the current frame x(0), and the five preceding frames x(-l), x(-2), etc.
  • the vertical axis represents amplitude and the horizontal axis represents time.
  • the values sent ⁇ (n) are indicated below the amplitude which is recreated from the differential value being added to the previous value.
  • Panel A shows the situation if the correct ⁇ values are sent The reconstructed values equal the original values.
  • Panel B shows the situation if an incorrect value is transmitted, for instance ⁇ (-
  • Panel C shows the situation if a factor ⁇ is used.
  • the differential that will be sent is no longer the simple difference, but rather:
  • ⁇ (-3) equals +12.5, etc. If no error corrupts the values sent, the reconstructed values (boxes) are the same as the original, as shown in panel C. However, if an error, such as a bit error corrupts the differential values sent, such as sending ⁇ (-3) equals +40 rather than +12.5, the effect of the error is minimized, and decays with time.
  • the errant value is reconstructed as 47.5 rather than the 50 that would be the case with no decay factor.
  • the next value, which should be zero is reconstructed as 20.63, rather than as 30 in the case where no ⁇ decay factor is used.
  • the next value, also properly equal to zero is reconstructed as 15.47, which, although incorrect, is closer to being correct than the 30 that would again be calculated without the decay factor. The next calculated value is even closer to being correct, and so on.
  • the decay factor can be any number between zero and one. If a smaller factor, such as .5 is used, the error will decay away faster. However, less of a coding advantage will be gained from the differential coding, because the differential is necessarily increased. The reason for using differential coding is to obtain an advantage when the frame-to-frame difference, as compared to the absolute value, is small. In such a case, there is a significant coding advantage for differential coding. Decreasing the value of the decay factor increases the differences between the predicted and the actual values, which means more bits must be used to achieve the same quantization accuracy.
  • the prediction residuals i are divided into blocks.
  • a preferred method for dividing the residuals into blocks and then generating DCT coefficients is disclosed fully in U.S. Patent Nos. 5,226,084 and 5,247,579.
  • the binary representation can be transmitted, stored, etc., depending on the application.
  • the spectral log amplitudes can be reconstructed from the binary representation by first reconstructing the quantized DCT coefficients for each block, performing the inverse DCT on each block, and then combining with the quantized spectral log amplitudes of the previous segment using the inverse of Equation (7).
  • Error correction codes allow infrequent bit errors to be corrected, and they allow the system to estimate the error rate. The estimate of the error rate can then be used to adaptively process the model parameters to reduce the effect of any remaining bit errors.
  • the quantized speech model parameter bits are divided into three or more different groups according to their sensitivity to bit errors, and then different error correction or detection codes are used for each group.
  • the group of data bits which is determined to be most sensitive to bit errors is protected using very effective error correction codes.
  • Less effective error correction or detection codes, which require fewer additional bits, are used to protect the less sensitive data bits.
  • This method allows the amount of error correction or detection given to each group to be matched to its sensitivity to bit errors. The degradation caused by bit errors is relatively low, as is the number of bits required for forward error correction.
  • error correction or detection codes which is used depends upon the bit error statistics of the transmssion or storage medium and the desired bit rate.
  • the most sensitive group of bits is typically protected with an effective error correction code such as a Hamming code, a BCH code, a Golay code or a Reed- Solomon code.
  • the error correction and detection codes used herein are well suited to a 6.4 kbps IMBE speech coder for satellite communications.
  • the bits per frame which are reserved for forward error correction are divided among [23,12] Golay codes which can correct up to 3 errors, [15,11] Hamming codes which can correct single errors and parity bits.
  • the six most significant bits from the fundamental frequency ⁇ and the three most significant bits from the mean of the PRBA vector are first combined with three parity check bits and then encoded in a [23,12] Golay code. Thus, all of the six most significant bits are protected against bit errors.
  • a second Golay code is used to encode the three most significant bits from the PRBA vector and the nine most sensitive bits from the higher order DCT coefficients. All of the remaining bits except the seven least sensitive bits are then encoded into five [15,11] Hamming codes. The seven least significant bits are not protected with error correction codes.
  • the received bits are passed through Golay and Hamming decoders, which attempt to remove any bit errors from the data bits.
  • the three parity check bits are checked and if no uncorrectable bit errors are detected then the received bits are used to reconstruct the MBE model parameters for the current frame. Otherwise if an uncorrectable bit error is detected then the received bits for the current frame are ignored and the model parameters from the previous frame are repeated for the current frame.
  • the known method uses the previous frame to predict the current frame (which is essentiaUy a differential sort of prediction)
  • the predictions for the current frame wiU be based on the general location of the curve, or the distance from the origin. Since the current frame does not necessarily share the general location of the curve with its predecessor, the difference between the prediction and the actual value for the spectral amplitudes of the current frame can be quite large. Further, because the system is basically a differential coding system, as explained above, differential errors take a relatively long time to decay away. Since, it is an object of the prediction method to minimize the prediction residuals this effect is undesireable.
  • the invention is for use with either the decoding portion or the encoding portion of a speech coding and decoding pair of apparati.
  • the coder/decoder are operated according to a method by which a timewise segment of an acoustic speech signal is represented by a frame of a data signal characterized at least in part by a fundamental frequency and a pluraUty of spectral harmonics.
  • the current frame is reconstructed by the decoding apparatus, which reconstructs signal parameters characterizing a frame, using a set of prediction signals.
  • the prediction signals are based on: reconstructed signal parameters characterizing the preceding frame and a pair of parameters that specify the number of spectral harmonics for the current frame and the preceding frame.
  • One aspect of the invention is a method of generating the prediction signals.
  • the parameters upon which the prediction is based are reconstructed from digitally encoded signals that have been generated using error protection for aU of the digital bits used to encode each of the spectral harmonic parameters.
  • the parameter can be the number of spectral harmonics in the frame, derived from the six most significant bits of the fundamental frequency.
  • Another aspect of the invention is the method of generating a coded signal representing the spectral harmonics, using the method of generating prediction signals based on parameters that are highly protected against bit errors.
  • Yet another aspect of the invention is an apparatus for generating prediction signals, including means for performing the steps mentioned above.
  • Yet another aspect of the invention is a method for generating such prediction signals, further including the step of scaling the ampUtudes of each of the set of prediction signals by a factor that is a function of the number of spectral harmonics for the current frame, having at least one domain interval that is an increasing function of the number of spectral harmonics.
  • This aspect of the invention has the advantage that bit errors introduced into the spectral ampUtudes as transmitted decay away over time. Further, for speech segments where extra bits are available the effect of bit errors decays away more quickly with no loss in coding efficiency.
  • Another aspect of the invention is the method of generating a coded signal representing the spectral harmonics, using the method of generating prediction signals where bit errors decay away over time, more quickly where extra bits are available.
  • Yet another aspect of the invention is an apparatus for generating prediction signals, including means for providing for the decay of bit errors mentioned above.
  • Yet another aspect of the invention is a method for generating such prediction signals, further including the step of reducing the ampUtudes of each of the set of predictions by the average amplitude of aU of the prediction signals, averaged over the current frame.
  • This aspect of the invention has the advantage that the effect of bit errors related to the average value of the spectral amplitudes for a particular frame introduced into the prediction signals are limited to the reconstruction of only one frame.
  • Another aspect of the invention is the method of generating a coded signal representing the spectral harmonics, using the method of generating prediction signals where bit errors related to the average value of the spectral ampUtudes for a particular frame introduced into the prediction signals are Umited to the reconstruction of only one frame.
  • Yet another aspect of the invention is an apparatus for generating prediction signals, including means for protecting against the persistance of bit errors related to the average value of the spectral ampUtudes for a particular frame. .
  • Figure 1A is a schematic representation of an analog speech signal,showing overlapping frames.
  • Figure IB is a schematic representation of two frames of sampled speech signal.
  • Figure 1C is a schematic representation of the spectral amplitudes that contribute to two successive frames of sampled speech.
  • Figure 2 is a flow chart showing the steps that are taken to encode spectral amplitudes according to a method that uses prediction of spectral ampUtudes in the decoder, along with the transmission of prediction residuals by the encoder.
  • Figure 3 is a flow chart showing the steps that are taken to decode spectral amplitudes encoded according to the method shown in Fig. 2.
  • Figure 4 is a schematic representation showing the effect of a method to minimize over time the effects of bit errors in a differential coding system.
  • Figure 5 is a flow chart showing the steps that are taken to encode spectral ampUtudes according to a version of the method of the invention that uses prediction of spectral amplitudes in the decoder, along with the transmission of prediction residuals by the encoder, with the prediction values made on the basis of the number of spectral harmonics.
  • Figure 6 is a flow chart showing the steps that are taken to decode speech encoded according to a version of the method shown in Fig. 5.
  • Figure 7 is a flow chart showing schematicaUy an overview of the steps that are taken according to a version of the method of the invention to make predictions of the spectral amplitudes of the current frame in the decoder.
  • Figure 8 is a flow chart showing schematicaUy an overview of the steps that are taken according to a version of the method of the invention to make predictions of the spectral amplitudes of the current frame in the encoder.
  • Figure 9 is a flow chart showing schematicaUy the steps that are taken according to a version of the method of the invention to generate a variable decay factor, that depends on the amount of information that must be encoded in a frame.
  • Figure 10A is a schematic block diagram showing an embodiment of the coding apparatus of the invention.
  • Figure 10B is a schematic block diagram showing an embodiment of the decoding apparatus of the invention.
  • Figures 11 A and 1 IB show schematically in flow chart form, the steps of a version of the method of the invention for predicting the spectral log ampUtudes of the current frame.
  • the present invention provides improved methods of predicting the spectral ampUtudes M j (0). These prediction method steps are conducted as part of the decoding steps, shown in Fig. 3, for instance as part of step 314. Because the coder also mirrors the decoder and conducts some of the same steps, the prediction method steps of the invention wiU also be conducted as part of the coding steps, for instance at step 214 shown in Fig. 2.
  • a representative version of the coding apparatus 1000 of the invention is shown schematically by Fig. 10 A.
  • a speech signal is received by an input device 1002 such as a microphone, which is connected as an input device to a digital signal processor under appropriate computer instruction control, such as the DSP32C available from AT & T.
  • the DSP 1004 conducts all of the data processing steps specified in the discussion of the method of the invention below. In general, it digitizes the speech signal and codes it for transmission or other treatment in a compact form. Rather than a DSP, a suitably programmed general purpose computer can be used, although such computing power is not necessary, and the additional elements of such a computer typically result in lower performance than with a DSP.
  • a display 1008 provides the user with information about the status of the signal processing and may provide visual confirmation of the commands entered by input device 1006.
  • An output device 1010 directs the coded speech data signal as desired, typicaUy to a communication channel, such as a radio frequency channel, telephone Une, etc.
  • the coded speech data signal can be stored in memory device 1011 for later transmission or other treatment
  • Fig. 10B A representative version of the decoding apparatus 1030 of the invention is shown schematically by Fig. 10B.
  • the coded speech data is provided by input device 1020, typically a receiver, which receives the data from a data communication channel. Alternatively, the coded speech data can be accessed from memory device 1021.
  • DSP 1014 which can be identical to DSP 1004, is programmed according to the method steps of the invention discussed below, to decode the incoming speech data signal, and to generate a synthesized digital speech signal, which is provided to output controUer 1022.
  • a display 1018 such as a liquid crystal display, may be provided to display to the user the status of the operation of the DSP 1014, and also to provide the user with confirmation of any commands the user may have specified through input device 1016, again, typicaUy a keypad.
  • the synthesized speech signal may be placed into memory 1021 (although this is typicaUy not done, after the coded speech data has been decoded) or it may be provided to a reproduction device, such as loudspeaker 1012.
  • the output of the loudspeaker is an acoustic signal of synthesized speech, corresponding to the speech provided to microphone 1002.
  • the prediction steps performed in the digital signal processors 1004 and 1014 of the coder 1000 and the decoder 1030 do not use as inputs the transmitted and reconstructed fundamental frequencies ⁇ (0) and ⁇ (-l) as is done in the method discussed in U.S. Patent Nos. 5,226,084 and 5,247,579.
  • the fundamental frequencies ⁇ are specified using eight bits.
  • only the most significant six bits are absolutely protected by virtue of the error protection method. Therefore, undetected, uncorrected errors can be present in the fundamental frequencies as reconstructed by the decoder.
  • the errors that may arise in the two least significant bits can create a large difference in the identification of higher harmonic spectral locations, and thus their ampUtudes.
  • the method of the invention uses, as an input to the steps 214 and 314 of predicting the spectral log amplitudes the number of spectral harmonics, L(0) and L(-l) .
  • L(0) and L(-l) L(-l) .
  • An aspect of the invention is the reaUzation that the number of spectral ampUtudes L(0) and L(-l) can be used to generated the indices used in the interpolation step of the prediction
  • Another aspect of the invention is the reaUzation that these parameters can be derived from the highly protected most significant six bits specifying the fundamental frequency.
  • the steps of a version of the method of the invention for predicting the spectral log ampUtudes arc shown schematicaUy in flow chart form in Figs. 11 A and 1 IB. As explained below, these steps are conducted in both the coder and the decoder. The only difference is whether the starting values have been transmitted over a channel (in the decoder) or not (in the coder).
  • the method begins at 1102, followed by getting the fundamental frequncy ⁇ (0) for the current frame at 1104.
  • the number of spectral harmonics is computed at step 1106 by the decoder according to the foUowing:
  • L is highly protected. Because L is highly protected, there is a much higher probability that L equals L, and thus there wiU be a much lower probability of deviation between the predicted values as generated by the decoder following the decoding steps 1302, and the anticipated predicted values as generated by the coder following the steps 1240 as it mirrors the decoding steps.
  • k is a modified index, modified based on the ratio of harmonic amplitudes in the previous segment relative to the current segment. It is also useful to define at step 1108:
  • the predicted value for the spectral log p amplitudes, Mi is generated at 1112 in the decoder step 1314 according to the following:
  • the b and d terms represent the log ampUtudes of the spectral harmonics of the previous segment that bracket the frequency of the spectral harmonic of the current frame, selected by virtue of the floor function subscript, which is based on equations 12 and 13. (A vector containing these values has been obtained from a memory at 1110.)
  • the a and c terms represent the weights to be given to these amplitudes, depending on their distance (along the frequency axis) from the frequency calculated to be the frequency of the 1 th harmonic.
  • the decay factor ⁇ may be as above, although in another aspect of the invention, the decay factor is further enhanced, as discussed below. If no decay factor is used, the method proceeds from 1114 to the end at 1128, having established the vector of predicted spectral log amplitudes for the current frame.
  • a decay factor is used, and it is determined at 1116, as discussed below in connection with Fig. 9.
  • the reconstructed spectral log ampUtudes M-,(0) are computed at 1316, generally as discussed above, by adding the predicted spectral log ampUtudes to the reconstructed prediction residuals T, , as follows:
  • the predicted values are more accurate as compared to those predicted according to the method of U.S. Patent No. 5,226,084, because the parameters used to compute them are more highly protected against bit errors. Further, the predicted values are more closely mimiced, or anticipated by the mirror steps in the coder at step 1214, because the parameters upon which both coding step 1214 and decoding step 1314 are based are more likely to be equal. In order to reconstruct log2M j (0) using equation
  • M 1 (-1) M L( _ 1) (-1) for l > L(-l)
  • Figs. 11 A and 1 IB that the coder wiU take.
  • the coder computes p the predicted spectral log ampUtudes Mi using as inputs L(0) and L(-l) .
  • the coder at 1214 mimics aU of the steps that the decoder wiU conduct at 1314, as shown in Fig. 8. All of the steps that the coder uses at 1214 use the reconstructed ⁇ variables that should be used by the decoder. Because aU such variables are ultimately based on the highly protected L, there is a high probability that the values for the parameters generated by the coding steps in the coder are the same as the values for the parameters generated by the decoding steps in the decoder.
  • the L terms can differ between the coder and the decoder. Such a difference could, eventuaUy, degrade signal quality.
  • the reconstructed M terms reconstructed in the coder and the decoder. If bit errors enter into the vector of prediction residuals, these M terms can differ between the decoder and the coder, even though they were generated using the same equations.
  • the coder calculates k and ⁇ using equations 12 and 13 above.
  • the coder at 806 (Fig. 8) (and 1112 Fig. 11 A) calculates the predictions (that the decoder will make at 706 Fig. 7, (or 1112 Fig. 11 A)) based on the k , as distinguished from the method described in U.S. Patent No.
  • the vector of prediction residuals is treated at 1210, 1206 and 1208 according to the method described in U.S. Patent No. 5,226,084, i.e. it is divided into blocks, transformed by a discrete cosine transform, and quantized to result in a vector b v
  • This vector is transmitted to the decoder, the received version of which b ] is treated according to the method described in U.S. Patent No. 5,226,084 (reconstructed at
  • Another aspect of the invention relates to an enhanced method of minimizing over time the effect of bit errors transmitted in the vector b,, reconstructed into the prediction residuals .
  • This aspect of the invention relates to the decay factor, referred to as ⁇ in above, and is implemented in the method steps branching from decision "error decay?" 1114 in Fig. 11 A.
  • the decay factor
  • the decay factor is used to protect against the perpetuation of errors, with a larger decay factor providing less protection against perpetuation of errors, but allowing for a more purely differential coding scheme, thus possibly taking better advantage of similarities in values from one frame to the next.
  • these competing considerations are accomodated by having a variable decay factor similar to ⁇ , designated p, which varies depending on the number of harmonic amplitudes L(0) .
  • This decay factor is determined at 1116 (Fig. 11 and shown in detail in Fig. 9) used in the calculation of r
  • the foUowing values are used:
  • the variables x and y are .03 and .05, respectively.
  • Those of ordinary skill in the art wUl readily understand how to choose these variables, depending on the number of bits available, the type of signal being encoded, desired efficiency and accuracy, etc.
  • the steps that implement the selection of p, embodied in equation (20), are iUustrated in flow chart form in Fig. 9. (Fig. 9 shows the steps that are taken by the decoder. The same steps are taken by the coder.)
  • the effect of such a selection of p is that for a relatively low number of spectral harmonics, i.e.
  • L(0) is in the low range (determined at decision step 904)
  • the decay factor p will be a relatively smaU number (estabUshed at step 906), so that any errors decay away quickly, at the expense of a less purely differential coding method.
  • the decay factor is high (established at step 912), which may result in a more persistent error, but which requires fewer bits to encode the differential values.
  • L(0) is in a middle range (also determined at decision step 908), the degree of protection against persistent errors varies as a function of L(0) (estabUshed at 910).
  • the function can be as shown, or other functions that provide the desired results can be used. Typically, the function is a nondecreasing function of L(0).
  • equation (19) is used, with the variable p, rather than the fixed ⁇ .
  • the decay factor is calculated in the coder in the same fashion as it is calculated in the decoder, shown in Fig. 9 and explained above.
  • these method steps are conducted in the coder and the decoder for reducing the persistence of errors in the reconstructed spectral ampUtudes caused by transmission errors.
  • the method of the invention also addresses another difficulty in accurate prediction in the decoder of the spectral ampUtudes.
  • This difficulty stems from the difference in the average amplitude of the spectral ampUtudes in one frame, as compared to the preceding frame. For instance, as shown in Fig. 1C, while the curve h estabUshing the shape of the spectral envelope is relatively similar in the current frame h(0) as compared to the preceding frame h(-l), the average ampUtude in the current frame is at least twice that in the prior frame.
  • equation (19) is appUed to p generate the estimates for M j , each estimate in the vector will be off by a significant amount which is related to the average amplitude in the frame.
  • the invention overcomes this problem by branching at decision 1120 "zero average?" as shown in Fig. 1 IB. If this aspect of the invention is not implemented, the method follows from decision 1120 to end 1128, and the predicted spectral log ampUtudes are not adjusted for the average ampUtude. Following the "yes" decision result from decision 1120, the method of the invention establishes at 1122 the average of the interpolated spectral log ampUtudes from each predicted spectral log ampUtude, p Mi computed as above and then subtracts this average from the vector of predictions at step 1126, as follows:
  • a factor is subtracted from the prediction, which factor is the average of aU of the predicted amplitudes, as based on the previous frame. Addition of this factor to the predicted spectral amplitude results in a zero mean predicted value.
  • the result at 1128 is that the average ampUtude of any previous frame does not figure into the estimation of any current frame. This happens in both the decoder and the coder. For instance, in the coder, at step 806, the coder generates the prediction residuals, T ] according to the following corresponding equation:
  • the coder when the coder generates the prediction residuals, it first subtracts from the actual values the fuU predicted values, based on the previous values, then adds to each p the average value of Mi (which can be different from the average value of the entire preceding frame). Consequently, the result of any error in the prediction in the previous frame of the average value is eUminated. This effectively eUminates differential coding of the average value. This has been found to produce little or no decrease in the coding efficiency for this term, while reducing the persistance of bit errors in this term to one frame.
  • any function of the number of spectral harmonics having at lest one domain interval that is an increasing function of the number of spectral harmonics is within the contemplation of the invention.
  • the means by which the average ampUtude of the frame is accounted for can be varied, as long as it is in fact accounted for.
  • the manipulations regarding the decay factor and the average ampUtude need not be conducted in the logarithm domain, and can, rather, be performed in the ampUtude domain or any other domain which provides the equivalent access.
  • An implementation of the invention is part of the APCO/NASTD/Fed Project 25 vocoder, standardized in 1992.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

L'invention utilise un segment temporel d'un signal vocal acoustique (210) qui est représenté par une trame d'un signal de données caractérisé au moins en partie par une fréquence fondamentale et par plusieurs harmoniques spectraux (240). La trame courante est reconstruite par un appareil de décodage, qui reconstruit les paramètres du signal caractérisant une trame, au moyen d'un groupe de signaux de prédiction. Ces signaux de prédiction (214) se fondent: sur les paramètres du signal reconstruit qui caractérisent la trame précédente et sur le nombre d'harmoniques spectraux pour la trame courante et pour la trame précédente.
PCT/US1993/011578 1992-11-30 1993-11-29 Procede et appareil pour la quantification des amplitudes d'harmoniques WO1994012972A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU56824/94A AU5682494A (en) 1992-11-30 1993-11-29 Method and apparatus for quantization of harmonic amplitudes

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US98341892A 1992-11-30 1992-11-30
US07/983,418 1992-11-30

Publications (2)

Publication Number Publication Date
WO1994012972A1 true WO1994012972A1 (fr) 1994-06-09
WO1994012972A9 WO1994012972A9 (fr) 1994-07-21

Family

ID=25529942

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1993/011578 WO1994012972A1 (fr) 1992-11-30 1993-11-29 Procede et appareil pour la quantification des amplitudes d'harmoniques

Country Status (2)

Country Link
AU (1) AU5682494A (fr)
WO (1) WO1994012972A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5774837A (en) * 1995-09-13 1998-06-30 Voxware, Inc. Speech coding system and method using voicing probability determination
FR2760885A1 (fr) * 1997-03-14 1998-09-18 Digital Voice Systems Inc Procede de codage de la parole par quantification de deux sous-trames, codeur et decodeur correspondants
US6070137A (en) * 1998-01-07 2000-05-30 Ericsson Inc. Integrated frequency-domain voice coding using an adaptive spectral enhancement filter
US6161089A (en) * 1997-03-14 2000-12-12 Digital Voice Systems, Inc. Multi-subframe quantization of spectral parameters
CN113362837A (zh) * 2021-07-28 2021-09-07 腾讯音乐娱乐科技(深圳)有限公司 一种音频信号处理方法、设备及存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4724535A (en) * 1984-04-17 1988-02-09 Nec Corporation Low bit-rate pattern coding with recursive orthogonal decision of parameters
US5954072A (en) * 1997-01-24 1999-09-21 Tokyo Electron Limited Rotary processing apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4724535A (en) * 1984-04-17 1988-02-09 Nec Corporation Low bit-rate pattern coding with recursive orthogonal decision of parameters
US5954072A (en) * 1997-01-24 1999-09-21 Tokyo Electron Limited Rotary processing apparatus

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5774837A (en) * 1995-09-13 1998-06-30 Voxware, Inc. Speech coding system and method using voicing probability determination
US5890108A (en) * 1995-09-13 1999-03-30 Voxware, Inc. Low bit-rate speech coding system and method using voicing probability determination
FR2760885A1 (fr) * 1997-03-14 1998-09-18 Digital Voice Systems Inc Procede de codage de la parole par quantification de deux sous-trames, codeur et decodeur correspondants
GB2324689A (en) * 1997-03-14 1998-10-28 Digital Voice Systems Inc Dual subframe quantisation of spectral magnitudes
US6161089A (en) * 1997-03-14 2000-12-12 Digital Voice Systems, Inc. Multi-subframe quantization of spectral parameters
GB2324689B (en) * 1997-03-14 2001-09-19 Digital Voice Systems Inc Dual subframe quantization of spectral magnitudes
US6070137A (en) * 1998-01-07 2000-05-30 Ericsson Inc. Integrated frequency-domain voice coding using an adaptive spectral enhancement filter
CN113362837A (zh) * 2021-07-28 2021-09-07 腾讯音乐娱乐科技(深圳)有限公司 一种音频信号处理方法、设备及存储介质
CN113362837B (zh) * 2021-07-28 2024-05-14 腾讯音乐娱乐科技(深圳)有限公司 一种音频信号处理方法、设备及存储介质

Also Published As

Publication number Publication date
AU5682494A (en) 1994-06-22

Similar Documents

Publication Publication Date Title
US5630011A (en) Quantization of harmonic amplitudes representing speech
EP0337636B1 (fr) Dispositif de codage harmonique de la parole
EP0560931B1 (fr) Procedes de quantification de signal vocal et de correction d'erreurs dans ledit signal
CA2254567C (fr) Quantification combinee des parametres de la parole
EP0336658B1 (fr) Quantification vectorielle dans un dispositif de codage harmonique de la parole
KR100531266B1 (ko) 스펙트럼 진폭의 듀얼 서브프레임 양자화
US5701390A (en) Synthesis of MBE-based coded speech using regenerated phase information
US5754974A (en) Spectral magnitude representation for multi-band excitation speech coders
US6122608A (en) Method for switched-predictive quantization
US5247579A (en) Methods for speech transmission
JP3996213B2 (ja) 入力標本列処理方法
EP1103955A2 (fr) Codeur de parole hybride harmonique-transformation
US5490230A (en) Digital speech coder having optimized signal energy parameters
MXPA01003150A (es) Procedimiento de cuantificacion de los parametros de un codificador de palabras.
WO1997005602A1 (fr) Procede et equipement de generation et de codage de racines carrees de spectres de raies
EP1385150B1 (fr) Procédé et dispositif pour la caractérisation des signaux audio transitoires
WO1994012972A1 (fr) Procede et appareil pour la quantification des amplitudes d'harmoniques
WO1994012972A9 (fr) Procede et appareil pour la quantification des amplitudes d'harmoniques
Moriya et al. An 8 kbit/s transform coder for noisy channels
EP0573215A2 (fr) Synchronisation de vocodeurs
KR100220783B1 (ko) 음성 양자화 및 에러 보정 방법

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AT AU BB BG BR BY CA CH CZ DE DK ES FI GB HU JP KP KR KZ LK LU MG MN MW NL NO NZ PL PT RO RU SD SE SK UA US VN

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

COP Corrected version of pamphlet

Free format text: PAGES 1/12-12/12,DRAWINGS,REPLACED BY NEW PAGES 1/9-9/9;DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA