EP1057292B1

EP1057292B1 - A fast frequency transformation techique for transform audio coders

Info

Publication number: EP1057292B1
Application number: EP98909964A
Authority: EP
Inventors: Mohammed Javed Absar; Sapna George; Antonio Mario Alvarez-Tinoco
Original assignee: STMicroelectronics Asia Pacific Pte Ltd
Current assignee: STMicroelectronics Asia Pacific Pte Ltd
Priority date: 1998-02-21
Filing date: 1998-02-21
Publication date: 2004-04-28
Anticipated expiration: 2018-02-21
Also published as: DE69823557D1; WO1999043110A1; EP1057292A1; DE69823557T2

Description

Technical Field

This invention is applicable in the field of multi-channel audio coders which use modified discrete cosine transform as a step in the compression of audio signals.

Background Art

In order to more efficiently broadcast or record audio signals, the amount of information required to represent the audio signals may be reduced. In the case of digital audio signals, the amount of digital information needed to accurately reproduce the original pulse code modulation (PCM) samples may be reduced by applying a digital compression algorithm, resulting in a digitally compressed representation of the original signal. The goal of the digital compression algorithm is to produce a digital representation of an audio signal which, when decoded and reproduced, sounds the same as the original signal, while using a minimum of digital information for the compressed or encoded representation.
Recent advances in audio coding technology have led to high compression ratios while keeping audible degradation in the compressed signal to a minimum. These coders are intended for a variety of applications, including 5.1 channel film soundtracks, HDTV, laser discs and multimedia. Description of one applicable method can be found in the Advanced Television Systems Committee (ATSC) Standard document entitled "Digital Audio Compression (AC-3) Standard", Document A/52, 20 December, 1995.
In the basic approach, at the encoder the time domain audio signal is first converted to the frequency domain using a bank of filters. The frequency domain coefficients, thus generated, are converted to fixed point representation. In fixed point syntax, each coefficient is represented as a mantissa and an exponent. The bulk of the compressed bitstream transmitted to the decoder comprises these exponents and mantissas.
The exponents are usually transmitted in their original form. However, each mantissa must be truncated to a fixed or variable number of decimal places. The number of bits to be used for coding each mantissa is obtained from a bit allocation algorithm which may be based on the masking property of the human auditory system. Lower numbers of bits result in higher compression ratios because less space is required to transmit the coefficients. However, this may cause high quantization errors, leading to audible distortion. A good distribution of available bits to each mantissa forms the core of the advanced audio coders.
The frequency transformation phase has one of the greatest computation requirements in a transform coder. Therefore, an efficient implementation of this phase can decrease the computation requirement of the system significantly and make real time operation of the encoder more easily attainable.
In some encoders such as those specified in the AC-3 standard, the frequency domain transformation of signals is performed by the modified discrete cosine transform (MDCT). If directly implemented, the MDCT requires O(N ²) additions and multiplications.
However it has been found possible to reduce the number of required operations significantly if the MDCT equation is able to be computed in a from that is amenable to the use of the well known Fast Fourier Transform (FFT) method of J.W. Cooley and J.W. Tukey (1960). A known application of an FFT to an MDCT is disclosed in, for example, EP-A-0564089.
The present invention seeks to provide an alternative computation method using a Fast Fourier Transform. Moreover, the invention seeks to use a single FFT for two channels to achieve greater reduction in computational requirements of the system.

Summary of the Invention

In accordance with the present invention there is provided a method for coding audio data according to claims 1 and 8.
The present invention further provides a method for coding audio data as defined in claim 14, including the steps of:

obtaining first and second input sequences of digital audio samples x[n], y[n] corresponding to respective first and second audio channels;
combining the first and second input sequences of digital audio samples into a single complex input sample sequence z[n], where z[n] = x[n] + jy[n];
pre-processing the complex input sequence samples including applying a pre-multiplication factor cos(πn/N) + jsin(πn/N) to obtain modified complex input sequence samples, where N is the number of audio samples in each of the first and second input sequences and n = 0,....,(N-1);
transforming the modified complex input sequence samples into a complex transform coefficient sequence Z _k utilising a fast Fourier transform, wherein k = 0,....,(N/2-1); and
post-processing the sequence of complex transform coefficients to obtain first and second sequences of audio coded frequency domain coefficients corresponding to the first and second audio channels X _k, Y _k according to: $G_{k} = (Z_{k} + Z_{N-k -1}^{*})/2 k =0... N /2-1$ ${G'}_{k} = (Z_{k} - Z_{N-k -1}^{*})/2 j k =0... N /2-1$ $X_{k} = cosγ ∗ (g_{k} _{,} _{r} cos(π(k +1/2)/ N)- g_{k} _{,} _{i} sin(π(k +1/2)/ N) - sinγ ∗ (g_{k,r} sin(π(k +1/2)/ N)+ g_{k,i} cos(π(k +1/2)/ N)$ $Y_{k} = cosγ ∗ ({g'}_{k,r} cos(π(k +1/2)/ N)- {g'}_{k,i} sin(π(k +1/2)/ N) - sinγ ∗ ({g'}_{k,r} sin(π(k +1/2)/ N)+ {g'}_{k,j} cos(π(k +1/2)/ N)$

G

_k

G'

_k

g

_k,r

g

_k,i

G

_k

g'

_k,r

g'

_k,i

G'

_k

Z

^*

_N-k-1

Z

_N-k-1

k

The modified discrete cosine transform equation can be expressed as where x[n] is the input sequence for a channel and N is the transform length.
Instead of evaluating X _k in the form given above it could be computed as where The symbol j represents the imaginary number $\sqrt{-1}$ . The expression is obtained from the well known FFT method, by first using transformation x'[n]=x[n]∗e^jπn/N and then computing the FFT
For a two channel approach, a complex variable z[n] = x[n]∗e ^j ^π ⁿ ^/ ^N + jy[n]∗e ^j ^π ⁿ ^/ ^N is defined, where x[n] and y[n] are sample sequence for the two channels and e ^jπn ^/ ^N represents the pre-multiplication factor. Using FFT approach, the frequency coefficient Z _k for the variable z[n] is computed. From Z _k the value G _k = (Z _k + Z ^* _N-k-1)/2 and G' _k =(Z _k - Z ^* _N-k-1)/2j, required to compute the final MDCT for each channel, respectively, is calculated.
If either or both the channels require short length transformers, two short transforms are taken using the above approach. If neither need short transform, a single long transform is used. As an additional step in reducing computation, the windowing function can be combined with the pre-processing stage.

Brief Description of the Drawings

The invention is described in detail hereinafter, by way of example only, with reference to preferred embodiments thereof and with aid of the accompanying drawings, wherein:

Figure 1 is a diagrammatic representation of a stream of audio data and the substructure arrangement thereof;
Figure 2 is a functional block diagram of a digital audio encoder;
Figure 3 is a functional block diagram of a system for encoding a single audio channel; and
Figure 4 is a functional block diagram of a system for encoding a pair of audio channels.

Detailed Description of the Preferred Embodiments

The above mentioned Advanced Television Systems Committee (ATSC) Standard document entitled "Digital Audio Compression (AC-3) Standard" (Document A/52, 20 December, 1995) describes methods for encoding and decoding audio signals.
In general, the input to an audio coder comprises a stream of digitised samples of the time domain analog signal. For a multi-channel encoder the stream consists of interleaved samples for each channel. The input stream is sectioned into blocks, each block containing N consecutive samples of each channel (see Fig. 1). Thus within a block the N samples of a channel form a sequence {x[0], x[1], x[2], ..., x[N-1]}.
The time domain samples are next converted to the frequency domain using an analysis filter bank (see Fig. 2). The frequency domain coefficients, thus generated, form a coefficient set which can be identified as (X ₀, X ₁, X ₂, ..., X _N _/2-1).Since the signal is real only the first N/2 frequency components are considered. Here X ₀ is the lowest frequency (DC) component while X _N _/ _2-1 is the highest frequency component of the signal.
Audio compression essentially entails finding how much of the information in the set (X ₀, X ₁, X ₂, ..., X _N _/2-1) is necessary to reproduce the original analog signal at the decoder with minimal audible distortion.
The coefficient set is normally converted into floating point format, where each coefficient is represented by an exponent and mantissa. The.exponent set is usually transmitted in its original form. However, the mantissa is truncated to a fixed or variable number of decimal places. The value of number of bits for coding a mantissa is usually obtained from a bit allocation algorithm which for advanced psychoacoustic coders may be based on the masking property of the human auditory system. A low number of bits results in high compression ratio because less space is required to transmit the coefficients.
However this causes very high quantization error leading to audible distortion. A good distribution of available bits to each mantissa forms the core of the most advanced encoders.
In some encoders such as the AC-3, the frequency domain transformation of signals is performed by the (MDCT) modified discrete cosine transform (Eq. 1). If directly implemented in the form given above, the MDCT requires O(N ² ) additions and multiplications.

Single Channel FFT

It is possible to reduce the number of required operations significantly if one is able to evaluate Eq. 1 using the well known Fast Fourier Transform method of J.W. Cooley and J.W. Tukey (1960). The general Discrete Fourier Transform (DFT) is given below (Eq. 2). It requires O(N ² ) complex additions and multiplications. By using the Fast Fourier Transform method the DFT in Eq. 2 can be computed with O(Nlog2N) operations only. Here j is the symbol for imaginary number, i.e. j = √-1.
Although it may not be immediately apparent how Eq. 1 can be transformed to Eq. 2, a careful analysis shows that this is indeed possible. To simplify Eq. 1, two functions can be defined $α(n,k) = 2π(2 n +1)(2 k +1)/4N$ $γ(k) = π(2 k +1)/4$ Then, using these functions, Eq. 1 can be rewritten as In Eq. 6 the trigonometric equality, cos(a+b) = cosa cosb-sina sinb is used for simplification. Furthermore, since the function γ(k) is not dependant on variable n, it can be brought outside the summation expression to give where The two terms, T ₁ and T ₂, can now be evaluated separately. Using Euler's identity e^lθ = cosθ +jsinθ, we can express: $cosα(n,k {)=(e}^{jα(n,k)} {+ e}^{-jα(n,k)})/2$ and $sinα(n, k {)=(e}^{jα(n,k)} {-e}^{-jα(n,k)})/2 j .$
Therefore we can rewrite the term T ₁ as where
Similarly
The term A ₁ can thus be evaluated from Eq. 8 and Eq. 9
If a complex variable is defined as: $x'[n]= x [n {]∗e}^{jπn/N}$ then Eq. 10 is simply: where
The complex term G_k = g_k,r+g_k,i, where g_k,r and g_k,i ∈ (set of real numbers) in Eq. 12 is essentially the same as F_k in Eq. 2. Therefore the FFT approach can be used to evaluate G_k. This brings down computation from O(N ²) to O(NlogN). Similarly, the second term A ₂ in Eq. 8 and Eq. 9 can be evaluated where
Note that G_k* is actually the complex conjugate of G_k which was obtained by Eq. 12. That is, if G_k = g_k,r+g_k,i, where g_k,r and g_k,i ∈ as defined earlier, then G_k* = g_k,r - jg_k,i. Therefore G_k* in Eq. 13 does not need to be computed again, and the result from Eq. 12 can be re-used. That is, only one FFT needs to be computed for the evaluation of T ₁. The result of Eq. 8 to Eq. 13 is thus $T_{1} = 1/2(e^{j π(k +1/2)/ N} G_{k} + e^{- j π(k +1/2)/ N} G_{k} ^{*})$
Next, the term T ₂ can be analysed
Finally, after simplifications of Eq. 7, 14 and 15 $X_{k} = cosγ(k) 1/2(e^{j π(k+1/2)/ N} G_{k} + e^{- j π(k +1/2)/ N} G_{k} ^{*}) - sinγ(k) 1/2 j (e^{j π(k +1/2)/ N} G_{k} - e^{- j π(k +1/2)/ N} G_{k} ^{*}) = cosγ ∗ (g_{k} _{,} _{r} cos(π(k +1/2)/ N)- g_{k} _{,} _{i} sin(π(k +1/2)/ N) - sinγ ∗ (g_{k} _{,} _{r} sin(π(k +1/2)/ N)+ g_{k} _{,} _{i} cos(π(k +1/2)/ N) = cosγ ∗ T_{1} - sinγ ∗ T_{2}$
The term G _k = g _k,r + jg _k,l is computed in O(NlogN) operation by use of FFT algorithms. The additional operation outlined in Eq. 16 to extract the final X _k is only of order O(N). Therefore the MDCT can now be computed in O(Nlog₂ N) time. The operations required to obtain the MDCT are illustrated in Fig. 3.

Combining Two Channels into Single FFT

Suppose the multi-channel encoder is required to process m audio channels. Instead of computing an FFT for each channel as described in the previous section, it is possible to further reduce the computational requirement of the coder by combining two channels and using a single FFT only. In effect, instead of m FFTs only m/2 FFTS need to be computed.
If the input sequence are real numbers then it is known that DFT for any two channels can be computed with only one FFT block by considering the input as a complex number.
The real part is formed from the sequence for any one channel and the imaginary part is from data of another channel. After the Fourier Transform is computed for the resulting complex variable, the resulting transform for each channel can be easily retrieved.
However, in the present case the input data to the FFT block is actually a complex number (formed by multiplying the real data by complex variable e ^jπn/^N). In this case, there is no straightforward way of retrieving the frequency transform after having combined two channels. However, using some processing after the FFT one can still compute the DFT of two channel using a single FFT block.
Let {x[0],x[1],x[2],...,x[N-1]} be N input samples of the first channel and {y[0],y[1],y[2],...y[N-1]} be the samples for the second channel. As described above, the frequency coefficients (Eq. 12 and 13) must be obtained for the first channel; and similarly, for the second channel Defining complex variable $z [n] = x [n {]∗e}^{jπn/N} + jy [n {]∗e}^{jπn/N}$ and computing its DFT using the FFT method, yields Eq. 17
Now substituting N-k for k in the above expression,
Since e^j2πn = 1, n ∈ I (the set of integers), the term e^j2πn vanishes in the above expression. Taking the complex conjugate of Z _N-k:
Using Eq. 18 and 20, separate expressions for G_k and G'_k are required. In a simple case the conjugates in Eq. 18 and 20 should add and subtract to give the required expressions. However in this instance that is not the case. But, substituting N-k by N-k-1 in Eq. 18, the following is obtained
Now the term e^{j2πn(k+1/2)/N} is common in both Eq. 17 and 19, and it is possible to isolate.
Similarly,
That is $G_{k} = (Z_{k} + Z_{N-k}^{*} _{-1})/2 k =0... N /2-1$ and ${G'}_{k} = (Z_{k} - Z_{N-k}^{*} _{-1})/2 j k =0... N /2-1$
From the expression from Eq. 22 and 23 into Eq. 16, the MDCT for each channel is obtained. The overall process is illustrated in Fig. 4.

Transform Length Adjustment Technique

The frequency transform length N is decided by the encoder based on temporal and spectral resolution requirements. The input signal is usually analysed with a high frequency bandpass filter to detect the presence of transients. This information is used to adjust the block length, restricting quantization noise associated with the transient within a small temporal region about the transient, avoiding temporal masking. Thus, if transient is detected in a channel, two short transform of length N/2 each are taken. In the absence of transient, a single long transform of length N is used, thus providing higher spectral resolution.
From the method described in the previous section for computing MDCT for two channels using a single FFT block, it is evident that the transform length for the two paired channels must be the same. Therefore, pairing for the transformation phase much be such that channels with identical transform length are grouped together.
It is however possible that not all channels can be paired with such convenience. Assume that the total number of channels are an even number (if not, take a single FFT for one channel and the rest form an even group). Suppose out of the m channels, l need long transform and therefore m-l require short transform.
If l is an even number, then since the total is even, it follows that l-m is also even. In this case, from the l channels that need long transform, l/2 pairs are formed and for each of the l/2 pairs a single FFT is computed to estimate the MDCT for the original paired channels. Similarly, the l-m channels are paired to form (l-m)/2 pairs and for the (l-m)/2 pairs two short FFTs are computed.
Now consider the case when 1 l = 2r + 1 is an odd number. Therefore m - l = 2s + 1 is also an odd number. The 2r channels requiring long transform are paired together to form r pairs and then 2r transforms are computed using r FFTs only. Similarly, for the 2s channels s pairs are formed. What remains is one channel requiring long transform and another requiring two short transforms. Both of these channels are paired together and two short FFTs are computed to derive the MDCT.
The rationale for constraining the long transform to two short ones is as follows. A short transform is required for restricting quantization noise associated with the transient within a small temporal region about the transient, avoiding temporal masking. A long transform gives slight better frequency resolution but the error is not much compared to the case when in the presence of transient a long transform is utilised. Forcing a long transform onto a channel in the presence of transient leads to greater distortion in the final produced music. This conjecture was proven true by experimental studies on benchmark music streams.

Combining Windowing with pre-processing

Before the time domain signal x[n] is transformed to the frequency domain, a windowing function is usually applied. Thus, if the sampled signal is p[n] then the sequence that is applied to the frequency transformation block is x[n] = p[n] ∗ w[n], where w[n] is the windowing function. From the previous sections we noted that before the FFT is computed for a block a pre-processing is performed as given in Eq. 11 (reproduced below for convenience). Thus $x' [n] = x [n] ∗ e^{j π n / N} = (p [n] ∗ w [n]) ∗ e^{j π n / N} = (p [n]∗ w [n])∗ (cosπ n / N + j sin π n / N) = p [n] ∗ ((w [n] ∗ cosπ n / N) + j (w [n]∗sinπ n / N))$
From Eq. 24 we note that the windowing function can be combined with the cosine and sine multiplication required in Eq. 11. This brings down the computation even further since the sine and cosine are usually implemented in a real time system as table-lookup. If two tables are constructed as defined below $r cos[n] = w [n] ∗ cos(π n / N)$ $r sin[n] = w [n]∗ sin(π n / N)$ then Eq. 11 can be rewritten as $x'[n] = (p [n] ∗ r cos[n]) + j (p [n]∗ r sin [n])$
Although the invention has been described herein primarily in terms of its mathematical derivation and application, and the procedures required for implementation, it will be readily recognised by those skilled in the art that the procedures described can be implemented by means of any desired computational apparatus. For example, the invention may be embodied in computer software operating on general purpose computing equipment, or may be embodied in purpose built circuitry or contained in microcode or the like in an integrated circuit or set of integrated circuits.
The foregoing detailed description of embodiments of the invention has been presented by way of example only, and is not intended to be considered limiting to the invention as defined in the claims appended hereto.

Glossary of Equations:

MDCT

$T_{1} = 1/2(e^{j π(k +1/2)/ N} G_{k} + e^{- j π(k +1/2)/ N} G_{k} ^{*})$ $T_{2} = 1/2 j (e^{j π(k +1/2)/ N} G_{k} - e^{- j π(k -1/2)/ N} G_{k} ^{*})$ $G_{k} = (Z_{k} + Z_{N-k}^{*} _{-1})/2 k =0... N /2-1$ ${G'}_{k} = (Z_{k} - Z_{N-k}^{*} _{-1})/2 j k =0... N /2-1$ $α(n, k) = 2π(2 n +1)(2 k +1)/4N$ $γ(k) = π(2 k +1)/4$

Claims

A method for coding audio data including the steps of:
obtaining at least one input sequence of digital audio samples;

pre-processing the input sequence samples including applying a pre-multiplication factor to obtain modified input sequence samples;

transforming the modified input sequence samples into a transform coefficient sequence utilising a fast Fourier transform; and

post-processing the sequence of transform coefficients including applying first post-multiplication factors to the real and imaginary coefficient components, differencing and combining the post-multiplied real and imaginary components, applying second post-multiplication factors to the difference and combination results, and differencing to obtain a sequence of modified discrete cosine transform coefficients representing said input sequence of digital audio samples.
A method as claimed in claim 1, wherein the pre-multiplication factor, and first and second post-multiplication factors are trigonometric function factors.
A method as claimed in claim 2, wherein the pre-multiplication factor applied to each digital audio sample in the input sequence is a trigonometric function of the audio sample sequence position and the number of samples in the sequence.
A method as claimed in claim 2, wherein the first post-multiplication factors for each transform coefficient in the sequence are trigonometric functions of the transform coefficient sequence position and the number of coefficients in the sequence.
A method as claimed in claim 2, wherein the second post-multiplication factor for each difference or combination result is trigonometric functions of the transform coefficient sequence position of the coefficients used in the difference or combination.
A method as claimed in any one of claims 1 to 5, wherein the pre-processing operations are performed on each sample in the input sequence individually.
A method as claimed in any one of claims 1 to 6, wherein the post-processing operations are performed on each transform coefficient in the sequence individually.
A method for coding audio data including the steps of:
obtaining first and second input sequences of digital audio samples corresponding to respective first and second audio channels;

combining the first and second input sequences of digital audio samples into a single complex input sample sequence;

pre-processing the complex input sequence samples including applying a pre-multiplication factor to obtain modified complex input sequence samples;

transforming the modified complex input sequence samples into a complex transform coefficient sequence utilising a fast Fourier transform; and

post-processing the sequence of complex transform coefficients to obtain first and second sequences of audio coded frequency domain coefficients corresponding to the first and second audio channels including, for each corresponding frequency domain coefficient in the first and second sequences, selecting first and second complex transform coefficients from said sequence of complex transform coefficients, combining the first complex transform coefficient and the complex conjugate of the second complex transform coefficient for said first channel and differencing the first complex transform coefficient and the complex conjugate of the second complex transform coefficient for said second channel, and applying respective post-multiplication factors to the combination and difference to obtain said audio coded frequency domain coefficients corresponding to the first and second audio channels.
A method as claimed in claim 8, wherein the pre-multiplication factor for each sample in the complex input sample sequence comprises a complex trigonometric function of the complex input sample sequence position and the number of samples in the sequence.
A method as claimed in claim 8 or 9, wherein the post-processing for each of the first and second channels includes applying first post-multiplication factors to the real and imaginary coefficient components, differencing and combining the post-multiplied real and imaginary components, applying second post-multiplication factors to the difference and combination results, and differencing to obtain a sequence of modified discrete cosine transform coefficients representing said input sequence of digital audio samples.
A method for coding audio data as claimed in claim 8, including examining said first and second sequences of digital audio samples to determine a short or long transform length, and coding the audio samples using a short or long transform length as determined.
A method for coding audio data comprising sequences of digital audio samples from a plurality of audio channels, comprising determining a transform length for each of the channels, pairing the channels according to their determined transform length, and coding the audio samples of first and second channels in each pair, as defined in claim 8, according to the determined transform length.
A method for coding audio data as claimed in any preceding claim, including applying a windowing function in combination with said step of applying a pre-multiplication factor.
A method for coding audio data including the steps of:
obtaining first and second input sequences of digital audio samples x[n], y[n] corresponding to respective first and second audio channels;

combining the first and second input sequences of digital audio samples into a single complex input sample sequence z[n], where z[n] = x[n] + jy[n];

pre-processing the complex input sequence samples including applying a pre-multiplication factor cos(πn/N) + jsin(πn/N) to obtain modified complex input sequence samples, where N is the number of audio samples in each of the first and second input sequences and n = 0,....,(N-1);

transforming the modified complex input sequence samples into a complex transform coefficient sequence Z_k utilising a fast Fourier transform, wherein k = 0,....,(N/2-1); and

post-processing the sequence of complex transform coefficients to obtain first and second sequences of audio coded frequency domain coefficients corresponding to the first and second audio channels X _k, Y _k according to: $G_{k} = (Z_{k} + Z_{N-k}^{*} _{-1})/2 k =0... N /2-1$ $G'_{k} = (Z_{k} - Z_{N-k}^{*} _{-1})/2 j k =0... N /2-1$ $X_{k} = cosγ ∗ (g_{k,r} cos(π(k +1/2)/ N)- g_{k,i} sin(π(k +1/2)/ N) - sinγ ∗ ({g'}_{k,r} sin(π(k +1/2)/ N)+ g_{k,i} cos(π(k +1/2)/ N)$ $Y_{k} = cosγ ∗ ({g'}_{k,r} cos(π(k +1/2)/ N)- {g'}_{k,i} sin(π(k +1/2)/ N) - sinγ ∗ ({g'}_{k,r} sin(π(k +1/2)/ N)+ {g'}_{k,i} cos(π(k +1/2) N)$
where G _k is a transform coefficient sequence for the first channel;
G' _k is a transform coefficient sequence for the second channel;
g _k,r and g _k,i are the real and imaginary transform coefficient components of G _k;
g' _k,r and g' _k,i are the real and imaginary transform coefficient components of G' _k;
Z ^* _N-k-1 is the complex conjugate of Z _N-k-1; and
γ(k) = π(2k+1)/4.