WO2011147950A1

WO2011147950A1 - Low-delay unified speech and audio codec

Info

Publication number: WO2011147950A1
Application number: PCT/EP2011/058701
Authority: WO
Inventors: Ralf Geiger; Markus Schnell; Guillaume Fuchs; Emmanuel Ravelli; Tom BÄCKSTRÖM; Jérémie Lecomte; Konstantin Schmidt; Nikolaus Rettelbach; Manfred Lutzky; Bernhard Grill
Original assignee: Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
Priority date: 2010-05-28
Filing date: 2011-05-27
Publication date: 2011-12-01
Also published as: TW201214415A; AR081264A1

Abstract

A unified speech and audio decoder is described, which comprises a frame buffer configured to buffer a sub-part of a datastream composed of consecutive frames in units of the frames so that the subpart continuously comprises at least one frame, each frame representing a coded version of a respective portion of consecutive portions of an audio signal, and each frame comprising a mode identifier assigning the respective frame to a respective one of a plurality of coding modes comprising a CELP (codebook excitation linear prediction) coding mode and a transform coded excitation linear prediction coding mode. Further, the unified speech and audio decoder comprises a CELP decoder configured to decode the frames to which the CELP coding mode is assigned to reconstruct the respective portions, and a transform coded excitation linear prediction decoder configured to decode the frames to which the transform coded excitation linear prediction coding mode is assigned, to reconstruct the respective portions, wherein the frame buffer is configured to distribute the frames buffered to the CELP decoder and the transform coded excitation linear prediction decoder under removal of the respective frames from the frame buffer, frame-wise.

Description

Low-Delay Unified Speech and Audio Codec

Description

The present invention is concerned with an unified speech and audio codec such as, for example, coding signals composed of both speech and music or other combinations of audio contributions of different type with time-varying ratio among these contributions. In particular, the present invention is concerned with a low-delay solution.

It is favorable to mix different coding modes in order to code general audio signals representing a mix of audio signals of different types such as speech, music or the like. The individual coding modes may be adapted for particular audio types, and thus, a multi-mode audio encoder may take advantage of changing the encoding mode over time corresponding to the change of the audio content type. In other words, the multi-mode audio encoder may decide, for example, to encode portions of the audio signal having speech content using a coding mode especially dedicated for coding speech, and to use another coding mode in order to encode different portions of the audio content representing non-speech content such as music. Codebook excitation linear prediction coding modes, for example, tend to be more suitable for coding speech contents, whereas transform coded excitation linear prediction coding modes tend to outperform codebook excitation linear prediction coding modes as far as the coding of music is concerned, for example. There have already been solutions for addressing the problem of coping with the coexistence of different audio types within one audio signal. The currently emerging USAC, for example, suggests switching between a frequency domain coding mode largely complying with the AAC standard, and two further linear prediction modes similar to sub- frame modes of the AMR-WB+ standard, namely a TCX mode and an ACELP mode. A certain framing structure is used in order to switch between FD domain and LP domain. The AMR-WB+ standard itself uses an own framing structure forming a sub-framing structure relative to the USAC standard. The AMR-WB+ standard allows for certain subdivision configurations sub-dividing the AMR-WB+ frames into smaller TCX and/or ACELP frames. Similarly, the AAC standard uses as a basis framing structure, but allows for the use of different window lengths in order to transform code the frame content. For example, either a long window and an associated long transform length may be used, or eight short windows with associated transformations of shorter length. On the other hand, some audio codecs have been especially designed for low-delay applications. For example, two-way communications such as via telephone or the like, necessitate a low coding delay in order to avoid unpleasant waiting times during communication. The AAC-ELD, for example, has been especially dedicated for these types of applications. Unfortunately, AAC-ELD is a pure frequency domain coding mode, and accordingly, AAC-ELD is not optimally designed for coding of mixed signals, i.e. audio signals unifying audio portions of different types.

Thus, it is an object of the present invention to provide a unified speech and audio codec comprising both abilities, namely coping with the coexistence of speech and non-speech portions within the audio signal to be coded, and keeping the coding delay low.

This object is achieved by the subject matter of the independent claims. In accordance with an embodiment of the present invention a unified speech and audio decoder comprises a frame buffer configured to buffer a sub-part of the datastream composed of consecutive frames in units of the frames so that the subpart continuously comprises at least one frame, each frame representing a coded version of a respective portion of consecutive portions of an audio signal, and each frame comprising a mode identifier assigning the respective frame to a respective one of a plurality of coding modes comprising a CELP (codebook excitation linear prediction) coding mode and a transform coded excitation linear prediction coding mode. Further, the unified speech and audio decoder comprises a CELP decoder configured to decode the frames to which the CELP coding mode is assigned to reconstruct the respective portions, and a transform coded excitation linear prediction decoder configured to decode the frames to which the transform coded excitation linear prediction coding mode is assigned, to reconstruct the respective portions, wherein the frame buffer is configured to distribute the frames buffered to the CELP decoder and the transform coded excitation linear predication decoder under removal of the respective frames from the frame buffer, frame-wise.

Accordingly, embodiments of the present invention provide a unified speech and audio encoder comprising a mode switch configured to assign to each of consecutive portions of an audio signal a respective one of a plurality of coding modes comprising a CELP coding mode and a transform coded excitation linear prediction coding mode, a CELP encoder configured to encode the portions to which the CELP coding mode is assigned to obtain CELP frames, and a transform coded excitation linear prediction encoder configured to encode the portions to which the transform coded excitation linear prediction mode is assigned to obtain transform coded frames, wherein the unified speech and audio encoder is configured such that each frame comprises a mode identifier identifying the CELP coding mode in case of the respective frame being a CELP frame, and the transform coded excitation linear prediction coding mode in case of the respective frame being a transform coded frame.

The construction of the codec by combining two linear prediction coding modes with concurrently performing the coding mode assignment in units a the frames by providing each frame with a mode identifier for a identifying or indicating the mode assigned to the respective frame, enables the achievement of a superior compromise between coding efficiency despite the coexistence of speech and non-speech portions on the one hand and low-delay on the other hand.

In accordance with am embodiment of the present invention, the length of the transform coded frames is restricted to the length of the CELP frames, i.e. both frame lengths are equal to each other. This tends to lower the coding efficiency as far as portions of the audio signals are concerned which are of non-speech and of high tonality because transform length scales with the frame length of the transform coded frame. However, the coding efficiency loss resulting therefrom is neglectable compared to the gain in coding delay reduction resulting from the restriction.

Preferred embodiments of the present invention are described below with respect to the figures, among which

Fig. 1 shows a block diagram of an unified speech and audio encoder according to an embodiment; and

Fig. 2 shows a block diagram of an unified speech and audio decoder according to an embodiment. Fig. 1 shows a unified speech and audio encoder according to an embodiment of the present invention. The unified speech and audio encoder 10 of Fig. 1 comprises a mode switch 12, a CELP encoder 14 and a transform coded excitation linear prediction (i.e. TCX) encoder 16. Optionally, the encoder may comprise a bandwidth extension module 18. In particular, the mode switch 12 has an input connected to an input 22 of encoder 10 for receiving the audio signal 24 to be encoded. If present, the bandwidth extension module 18 is connected between input 22 and the mode switch 12 input. Mode switch 12 has two outputs which are connected to inputs of CELP encoder 14 and TCX encoder 16, respectively. CELP encoder 14, TCX encoder 16 and, if present, bandwidth extension module 18 are connected via multiplexer 20 to an output 26 of encoder 10.

The unified speech and audio encoder of Fig. 1 is for encoding the audio signal 24 entering at input 22 at low coding delay and such that the coding efficiency remains high even if the type of audio signal entering at input 22 changes from non-speech audio type to speech and vice versa.

To this end, the unified speech and audio encoder supports two coding modes, namely two LP-linear prediction coding modes including a TCX (transform coded excitation) and a CELP (codebook excitation linear prediction) coding mode. In TCX and CELP coding modes, the audio content is subject to linear prediction analysis in order to obtain linear prediction coefficients, and these linear prediction coefficients are transmitted within the bitstream along with an excitation signal which, when filtered with a corresponding linear prediction synthesis filter using the linear prediction coefficients within the bitstream, yields the decoded representation of the audio content. As illustrated in Fig. 1 , CELP encoder 14 and TCX encoder 16 may share an LP analyzer 28 to this end, the LP analyzer 28 being connected to multiplexer 20 in order to forward the information concerning the linear prediction coefficients to the decoding side, as will be outlined in more detail below.

The TCX encoder 16 is responsible for the TCX mode. In TCX, the just-mentioned excitation signal is transform coded, whereas in the case of CELP coding modes for which CELP encoder 14 is responsible, the excitation signal is coded by indexing entries within a codebook or otherwise synthetically constructing a codebook vector of samples to be filtered with the afore-mentioned synthesis filter. In particular, a specific type of CELP coding may be implemented within encoder 14, such as a ACELP (algebraic codebook excitation linear prediction), according to which the excitation is composed of an adaptive codebook excitation and an innovation codebook excitation. As will be outlined in more detailed below, the TCX mode may be implemented such that the linear prediction coefficients are exploited at the decoder side directly in the frequency domain for shaping the noise quantization by deducing scale factors. In this case, TCX is set to recover the excitation signal in the transform domain from the data stream with transferring the LPC coefficients into frequency shaping information and applying same onto the excitation signal in the frequency domain directly, instead of transforming the excitation signal into the time domain first with then applying the synthesis filter based on the LPC filter coefficients. The latter process is, however, also feasible. Besides the main coding modes described so far, the audio encoder 10 is able to switch on/off additional sub-coding options such as bandwidth extension option supported by bandwidth extension module 18. After having described the structure of the encoder 10 of Fig. 1 rather generally along with a generic overview of the supported coding modes, the cooperation between the elements shown in Fig. 1 is described in more detail below.

In particular, mode switch 12 is configured to assign to each of consecutive portions 30a, 30b and 30c of the audio signal 24 a respective one of the afore-mentioned coding modes, namely the TCX mode and the CELP coding mode.

As shown in Fig. 1, each portion 30a, 30b and 30c may be of equal length either measured in time t or in number of samples, irrespective of the coding mode assigned thereto. Additionally or alternatively, portions 30a, 30b, 30c may be non-overlapping although the transform lengths used to code the TCX coded portions may extend beyond these portions into preceding and succeeding portions, respectively, as will be outlined below. In so far, the length of the TCX portions among portions 30a-c may be defined by their transform window length used to transform code same, minus the length of aliasing cancellation portions of these windows divided by 2. As far as the CELP portions are concerned, the extension of same may be determined to define the portion of the signal 24 which they encode.

In other words, the audio signal 24 may be sampled at a certain sampling rate, and the portions 30a to 30c may cover immediately consecutive portions of the audio signal 24 equal in time and number of samples, respectively. Mode switch 12 is configured to perform the mode assignment, for example, based on some cost measure optimization, the cost measure, for example, combining coding rate and quality. Thus, coding mode switch 12 is configured to assign the various portions 30a to 30c of the audio signal 24 to any of the two coding modes. For each frame 30a to 30c, mode switch 12 may be free to choose among both coding modes independent from the assignment of preceding portions which have previously been the subject to the assignment. Mode switch 12 forwards portions to which the CELP coding mode has been assigned to CELP encoder 14, and portions to which the TCX coding mode has been assigned to TCX encoder 16. However, it should be noted that the assignment performed by mode switch 12 might be the result of a cooperation between the encoders 14 and 16 and mode switch 12. For example, encoders 14 and 16 may perform trials on each frame 30a to 30c so that these trials may be evaluated by mode switch 12 in order to decide on the coding mode to be eventually used. Further, it should be noted that a transition from one coding mode to the other between one portion and the immediately following portion may result in the mode switch 12 forwarding both portions to both encoders 14 and 16, or sub-parts thereof, in order to allow for special aliasing cancellation measures to be performed.

The CELP encoder 14 is configured to encode the portions to which the CELP coding mode has been assigned to obtain CELP frames. CELP encoder 14 forwards the information underlying the CELP frames to the multiplexer 20 which, in turn, inserts same into the data stream output at output 26. Similarly, TCX encoder 16 is configured to encode the portions to which the TCX mode has been assigned in order to obtain TCX frames with forwarding the information underlying same to multiplexer 20 for insertion into data stream 32. Both encoders 14 and 16 are configured such that each frame 34a, 34b and 34c of the data stream 32 comprises a mode identifier indicating the mode of the respective frame. Accordingly, the resulting data stream 32 at output 26 comprises one frame 34a, 34b and 34c per portion 30a to 30c of the audio signal 24. As illustrated in Fig. 1, the frame length of frames 34a to 34c measured, for example, in bits, does not need to be equal to each other. Rather, the frames 34a to 34c may vary in length. As both encoders 14 and 16 are of the linear prediction type, a linear prediction analysis is continuously performed on the consecutive portions 30a to 30c of the audio signal 24. As mentioned above, an LP analyzer 28 co-owned by both encoders 14 and 16 may be responsible for the LP analysis. The LP analyzer 18 may be configured to analyze the audio content within a current portion in order to determine linear prediction filter coefficients. By this measure, LP analyzer 28 may generate linear prediction filter coefficients for each of the portions 30a to 30c. The linear prediction filter coefficients are then used by encoders 14 and 16 in order to perform the respective encoding as will be outlined in more detail below. The LP analyzer 28 may operate on a pre-emphasized version of the original content and the respective pre-emphasis filter may be a high pass and, stated more specifically, an n^th-order high pass filter, for example, such as H(z) = 1 - az^"1 with a being set, for example, to 0.68. LP analyzer 28 may be configured to determine the linear prediction coefficients for the incoming portions 30a to 30c by use of, for example, an auto-correlation or covariance method. For example, using an auto-correlation method, LP analyzer 28 may produce an auto-correlation matrix with obtaining the LPC coefficients using a Levinson-Durbin algorithm. The LPC coefficients define a synthesis filter which roughly models the human vocal tract, and when driven by an excitation signal, essentially models the flow of air through the vocal chords. This synthesis filter is modeled using linear prediction by LP analyzer 28. The rate at which the shape of vocal tracts change is limited. Accordingly, the LP analyzer 28 may use an update rate for the LPC coefficients adapted to the limitation and being different from the frame rate of frames 30a to 30c. In order to transmit the LPC coefficients, support side information on the LPC coefficients used may be transmitted to the decoding side via multiplexer 20 at a rate lower than the update rate. For example, the transmission rate may equal the portion rate of portions 30a to 30c. In particular, the update rate may be greater than the portion rate, and the transmission rate for LPC side information may be between the update rate (inclusively), and the portion rate (also inclusively). A granularity or update rate greater than the frame/portion rate is achievable by, for example, interpolating between the LPC coefficients transmitted in the data stream, for example, per frame/portion. For example, each portion could be sub-divided into 4 sub frames so that same would be of length 64 samples in case of 256 sample portions. Thus, temporal interpolation between the supporting LPC coefficient information may be used at the encoding and decoding side in order to filter the gap between the supporting times. By this measure, both encoder and decoder have access to the same quantized LPC coefficients.

The LP analysis performed by LP analyzer 28 thus provides information on certain filters or defines certain filters, such as the linear prediction synthesis filter H(z), the inverse filter thereof, namely the linear prediction analysis filter or whitening filter A(z) with H(z) = 1/A(z), and, optionally used, a perceptual weighting filter such as W(z) = Α(ζ/λ), wherein λ is a weighting factor. Thus, LP analyzer 28 transmits information on the LPC coefficients to multiplexer 20 for insertion into the data stream 32. This information may represent the quantized linear prediction coefficients in an appropriate domain, such as a spectral pair domain or the like. Even the quantization of the linear prediction coefficients may be performed in this domain. As already mentioned above, LP analyzer 28 may determine the LPC coefficients at an update rate greater than the rate at which the LPC coefficients are actually transmitted and reconstructed at the decoding side. The latter update rate may be achieved, for example, by interpolation between the LPC transmission supporting times and may even be higher then the portion rate, the LPC transmission supporting times may occur at the portion rate. Obviously, the decoding side only has access to the quantized LPC coefficients, and accordingly, the afore-mentioned filters defined by the corresponding reconstructed linear predictions are denoted by H(z), A(z) and W(z). As already outlined above, the LP analyzer 28 defines an LP synthesis filter H(z) and H(z), respectively, which, when applied to a respective excitation, recovers or reconstructs the original audio content besides some post-processing, which however, is not considered here for ease of explanation. CELP encoder 14 and TCX encoder 16 are for defining, or determining an approximate of, this excitation and transmitting respective information thereon to the decoding side via multiplexer 20 and data stream 32, respectively.

As far as the TCX encoder 16 is concerned, same may be configured to generate a spectral representation of the current TCX portion by use of window-based time-to-spectral transformation, such as an MDCT, weighting the spectral representation in accordance with the linear prediction filter coefficients for the current portion and coding the weighted spectral representation into the respective frame of the data stream 32, associated with the current portion. To be more precise, TCX encoder 16 may subject the incoming signal 34 at the current portion to which the TCX mode has been assigned, or a pre-emphasized version thereof (pre-emphasized by use of the above-mentioned pre-emphasis filter, for example), to a MDCT transformation using, for example, some overlap with the preceding and/or succeeding portions. In particular, the window 50 used for windowing and transforming the current portion (e.g. 30b) to the spectral domain in TCX encoder 16, may overlap with the succeeding frame (e.g. 30c) and/or preceding frame (e.g. 30a). The window function 50 used for windowing before the actual transformation, may comprise a zero portion 52_lj2 at a beginning and an end thereof, and an aliasing cancellation portion 54_1;2 at the leading and trailing edge of the current portion so as to the coincide with the aliasing cancellation portion of a preceding or succeeding TCX portion (e.g. 30a, 30c). The window function 50 may be defined to not include the zero portions 52i_i2. However, an alternative interpretation is also possible.

The spectral coefficients of the resulting spectral representation, i.e. a transform, such as a DCT, of the whole window 50 defining the transform length 56, may then be subject to a spectral weighting using the LPC coefficients as received from LP analyzer 28. The LPC coefficients are transferred to spectral weighting coefficients such that a resulting spectral formation corresponds to the analysis filter transfer function or the perceptually weighted analysis filter transfer function, the perceptual weighting being performed by the aforementioned perceptual weighting filter, for example. The weighted spectral representation thus obtained is then quantized and coded by TCX encoder 16 using for example, a spectrally uniform quantization step size thereby (perceptually) forming the quantization noise. Accordingly, TCX encoder 16 causes a minor delay due to the 56_li2 overlap of the window function 50 with preceding and succeeding portions, but this delay may be reduced by use of low-delay window functions, the non-zero portion(s) 52_lj2 of which overlap with the preceding/succeeding portions at merely a fraction of the TCX portion length. The fraction may, for example, be equal to or smaller than one fourth of the length of the portion 30b. That is, the non-zero portion of the window used may extend into the preceding and/or succeeding portions of the current portion merely at a length shorter than, or equal to, one fourth of the TCX portion length of the current portion. Alternatively, however, a 50 % overlap between the window functions may also be used.

In contrast to TCX encoder 16, CELP encoder 14 is configured to encode the current excitation of the current portion to which the CELP coding is assigned, by using codebook indices. In particular, CELP encoder 14 may be configured to approximate the current excitation by a combination of an adaptive codebook excitation and an innovation codebook excitation with transmitting the codebook indices yielding this approximation to the decoding side via multiplexer 20. CELP encoder 14 is configured to construct the adaptive codebook excitation for a current frame so as to be defined by a past excitation, i.e. the excitation used for a previously encoded CELP portion, for example, and an adaptive codebook parameter for the current CELP portion which somehow modifies the past excitation in order to yield the current adaptive codebook excitation. The adaptive codebook excitation may define pitch lag and gain prescribing how to modify the past excitation. The CELP encoder 14 encodes the adaptive codebook parameter into the bitstream 32 by forwarding same to multiplexer 20. Further, CELP encoder 14 may construct the innovation codebook excitation defined by an innovation codebook index for the current portion and encode the innovation codebook index into the data stream 32 by forwarding same to multiplexer 20 for insertion into the data stream 32 and respective frame 34a to 34c, respectively. In particular, CELP encoder 14 may be configured to determine the innovation codebook index along with a respective innovation codebook gain, and forward same for insertion into the data stream. In fact, both adaptive codebook parameter and innovation codebook excitation, and/or both gain values may be integrated into one common syntax element and commonly coded into the respective frame of the data stream 32. Together, same enable the decoder to recover the approximation of the current excitation thus determined by CELP encoder 14. In other words, the adaptive codebook may be defined in the data stream by pitch lag und gain, while the innovative codebook is signalled to the decoding side via information concerning a codebook index und a gain of the innovation codebook, wherein both gain values may be coded commonly. In order to guarantee the synchronization of the internal states of encoder and decoder, CELP encoder 14 not only determines the syntax elements for enabling the decoder to recover the current codebook excitation, but same also actually updates its state by actually generating same in order to use the thus obtained current codebook excitation, i.e. the approximation of the actual current excitation, as a starting point, i.e. the past excitation, for encoding the next CELP portion.

To be more precise, the CELP encoder 14 may be configured to, in constructing the adaptive codebook excitation and the innovation codebook excitation, minimize a perceptual weight distortion measure relative to the audio content of the current portion considering that the resulting excitation is subject to LP synthesis filtering at the decoding side for reconstruction. In effect, a codebook index could index certain tables available at the encoder as well as the decoding side in order to index or otherwise determine vectors serving as an excitation input of the LP synthesis filter. Contrary to the adaptive codebook excitation, the innovation codebook excitation is determined independent from the past excitation. In effect, CELP encoder 14 may be configured to determine the adaptive codebook excitation for the current CELP portion using the past and reconstructed excitation of the previously coded CELP portion by modifying the latter using a certain delay and gain value and a predetermined (interpolation) filtering, so that the resulting adaptive codebook excitation of the current portion minimizes a difference to a certain target for the adaptive codebook excitation recovering, when filtered by the synthesis filter, the original audio content. The just-mentioned delay, gain and filtering is indicated by adaptive codebook parameters. The remaining discrepancy is compensated by the innovation codebook excitation. Again, CELP encoder 14 suitably sets the innovation codebook index to find an optimum innovation codebook excitation which, when combined with (such as added to) the adaptive codebook excitation of the current portion, yields the current excitation for the current portion with the latter serving as the past excitation when constructing the adaptive codebook excitation of the following CELP portion. For further details, reference is made to the ACELP mode of the AMR-WB+ standard.

As already mentioned above, encoder 10 may optionally comprise a bandwidth extension module. This bandwidth extension module 18 may be configured to generate bandwidth extension side information for the portions 30a to 30c and insert the respective bandwidth extension information into the data stream, frame-wise by a multiplexer 20. As mentioned above, bandwidth extension module 18 is optional and may, accordingly, not be present. Alternatively, the encoder 10 may be switchable so as to switch on or off the operation of bandwidth extension module 18. If operative, bandwidth extension module 18 may operate as follows. First, bandwidth extension module 18 may operate on the original audio signal 24 and forward, for example, merely a band-limited portion thereof further on to mode switch 12. For example, bandwidth extension module 18 may operate on the audio signal 24 at the full sampling rate, whereas mode switch merely receives the audio signal 24 at half the sampling rate or at a sampling rate having another proper fractional ratio relative to the original sampling rate at which bandwidth extension module 18 performs the bandwidth extension coding. Bandwidth extension module 18 may, for example, perform a spectral analysis of the inbound audio signal 24 by use of, for example, an analysis filter bank. Using this analysis filter bank, the bandwidth extension module 18 may obtain a temporal/spectral sampling of the audio signal 24 at a spectral/temporal grid having a temporal resolution higher than the portion rate of portions 30a to 30c. See, for example, the illustrative dashed grid 70 in Fig. 1 as an example for the analysis filterbank grid. In order to obtain this spatio/temporal spectrogram 70 of the audio signal 24, bandwidth extension module 18 may use transform windows and a MDCT transform, a QMF filterbank as used in SBR according to HE-AAC, or a CLDFB (Complex Low Delay Filterbank) as used in LowDelay SBR according to AAC-ELD. The bandwidth extension module 18 then analyzes the spectral envelope of the spectrogram within a high frequency portion 72 of the audio signal 24, i.e. the spectral components of the audio signal 24 not forwarded to mode switch 12. Bandwidth extension module 18 may determine the spectral envelope by determining the energy of the spectrogram within spectral/temporal tiles of a spectral/temporal grid, which is coarser than the spectral/temporal grid provided by the afore-mentioned analysis filterbank, such as by summing-up the squares of the spectral coefficients 76 within these tiles. Based on this spectral envelope, the bandwidth extension module 18 determines SBR data which is sent via multiplexer 20 to the decoding side. At the decoding side, the high frequency portion may be reconstructed based on the SBR data by appropriately replicating (or otherwise transposing) the low frequency portion 78 of the reconstructed audio signal as obtained by decoding the CELP and TCX frames output by encoders 14 and 16, in order to obtain a finely varying high frequency pre-filling spectrum, and spectrally forming the latter in accordance with the spectral envelope defined by the SBR data. For further details regarding SBR reference is made to the AAC-ELD standard.

Alternatively, however, blind bandwidth extension as known from the AMR-WB standard may be used in order to extend the bandwidth reconstructible from the frames output by CELP encoder 14 and TCX encoder 16, respectively, to a higher frequency portion a t the decoding side.

To summarize the above, each frame 34a and 34c may have the following information incorporated therein: 1) a mode identifier indicating whether the current frame is associated with a portion 30a to 30c encoded using the CELP mode or TCX mode, respectively; 2) LPC coefficient data pertaining the associated portion 30a-c; as mentioned above, the LPC update rate may even be higher than the portion rate so that, for example, the LPC coefficients defined by the LPC coefficient data may change several times within the associated portion by way of interpolation in encoder and decoder; 3) bandwidth extension data such as SBR data, assisting the decoder in extending the bandwidth of the current frame compared to the bandwidth 78 as obtained by information content 3) and 4), respectively; in particular, the SBR data may cover, i.e. comprise information relating to the envelope of the high frequency portion 72 within, the temporal interval associated with the current portion 30a-30c;

4) in case of the current frame being a TCX frame, a coded representation of the weighted spectral representation (of the excitation or residual signal as obtained by applying the (perceptually weighted) analysis LPC filter transfer function onto the (pre-emphasized) audio signal); and

5) in case of the current frame being a CELP frame, a codebook index (such as the innovation codebook index) beside other data enabling the reconstruction of the current excitation signal based on the past excitation signal such as the adaptive codebook parameter and an energy or loudness related syntax element.

Accordingly, by restricting the decision regarding the main/core coding modes to the above outlined TCX and CELP coding mode, the encoder 10 is able to provide for a low coding delay at good coding efficiency even in case of audio signals of unspecified type, i.e. speech or non-speech. The low coding delay will also become apparent from the following description of a possible audio decoder.

Fig. 2 describes a unified speech and audio decoder 100 able to decode data streams generated by the encoder of Fig. 1 to reconstruct the original audio signal. The decoder 100 comprises a frame buffer 102, a CELP decoder 104 and a TCX decoder 106. Frame buffer 102 is connected between an input 108 of decoder 100 and respective inputs of decoders 104, 106, respectively. Respective outputs of decoders 104 and 106 are connected to a respective recombiner. Optionally, decoder 100 comprises a bandwidth extension module 1 12, with recombiner 1 10 being connected to output 1 14 of decoder 100 either directly or via optional bandwidth extension module 1 12.

The mode of operation of decoder 100 is as follows. At input 108, the data stream 32 as generated by the encoder Fig. 1 enters. As already mentioned above, the data stream 32 comprises consecutive frames 34a to 34c which, as illustrated in Figs. 1 and 2, may be arranged within the data stream 32 as self-contained or continuous portions of the data stream, respectively. However, another arrangement within the data stream 32 would also be feasible.

In any case, the frame buffer 102 is responsible for buffering frames 34a to 34c for being operated on in modules 104, 106, 1 10 and 112, respectively. The frame buffer 102 is configured to buffer the data stream 32 in units of these frames 34a to 34c and to distribute the frames buffered, to the CELP decoder 104 and the TCX decoder 106, respectively, under removal of the respective frames from the buffer, frame- wise. That is, the occupied storage space within frame buffer 102 increases and decreases in units of frames, respectively, and the available storage space may be configured to guarantee, for example, the accommodation of at least one frame. In other words, frame buffer 102 may be configured to buffer a sub-part of data stream in units of the frames so that the buffered subpart continuously comprises at least one frame, namely the currently to be decoded one. Of course, frame buffer may have an available storage space accommodating more than one frame at a time.

As already noted above, each frame comprises a mode identifier assigning the respective frame 34a to 34c to a respective one of a plurality of coding modes comprising the CELP coding mode and the transform coded excitation LP coding mode.

The CELP decoder is configured to decode the frames 34a to 34c to which the CELP coding mode is assigned, to reconstruct the respective portions 30a to 30c of the coded/reconstructible version 116 of the original audio signal 24. Likewise, TCX decoder 106 is configured to decode the frames 34a to 34c to which the TCX mode is assigned, to reconstruct the portions 30a to 30c of the reconstructed version 116, the coded version of which the respective frames represent. To be more precise, the frame buffer removes a frame currently to be decoded from its internal storage and distributes the information contained therein to the respective recipients. Needless to say, this demultiplexing function may be performed by an extra demultiplexer which could be positioned between frame buffer 102 on the one hand and modules 104, 106, 1 10 and 1 12 on the other hand. If, for example, the current frame is a TCX frame, frame buffer 102 forwards same to TCX decoder 106. At least, frame buffer 102 provides the TCX decoder 106 with the above-described weighted spectral representation of the excitation signal. Similarly, frame buffer 102 forwards CELP frames to CELP decoder 104. At least the codebook index is provided to CELP decoder 104. In case of bandwidth extension being applied, frame buffer 102 may forward the bandwidth extension data contained within the frames to bandwidth extension module 1 12. Similarly, frames positioned at transitions between TCX coding and CELP coding mode may comprise additional aliasing cancellation information, and frame buffer 102 may be configured to forward this additional information to recombiner 1 10. Lastly, the bitstream comprises the information on the linear prediction filter coefficients, and frame buffer forwards this information to CELP decoder 104 and TCX decoder 106. As both decoders 104 and 106 are of the linear prediction type and rely on the LPC filter coefficients, both decoders may share, or jointly own, a linear prediction coefficient decoder 1 18. As already described above, this linear prediction coefficient information decoder 1 18 obtains for each frame 34a to 34c, the corresponding liner prediction filter coefficients. To this end, decoder 118 may obtain supporting linear prediction coefficients from the data stream 32 corresponding to supporting times with interpolating the linear prediction coefficients to be used for the individual frames 34a to 34c by temporal interpolation.

Based on this linear prediction filter coefficient information, TCX decoder 106 and CELP decoder 104 decode the frames assigned to them.

In particular, TCX decoder 106 may be configured to decode a frame currently to be decoded and having the TCX mode assigned thereto in the following way. First, TCX decoder 106 may decode the weighted spectral representation from the current frame. This may, for example, include dequantization and re-scaling of the spectral coefficients of the weighted spectral representation. Then, TCX decoder 106 may perform the re-weighting of the weighted spectral representation using the linear prediction filter coefficients for the current frame as obtained by decoder 1 18. To this end, TCX decoder 106 may turn these linear prediction filter coefficients into spectral weighting factors together defining a spectral formation in accordance with a transfer function corresponding to an (optionally perceptually weighted) LPC synthesis filter defined by the linear prediction filter coefficients. TCX decoder 106 thus spectrally forms the weighted spectral representation as obtained from the data stream 32 in order to obtain a re-weighted spectral representation. The re-weighted spectral representation is then retransformed into time- domain by use of a window-based spectral-to-time transformation. Imagine, for example, portion 30b is the current TCX frame. After performance of the window based spectral-to- time transformation, TCX decoder 106 obtains a time-domain signal relating to a time portion of reconstructed signal 116 which overlaps portion 30b, which the currently decoded frame is associated with, but extends beyond that portion 30b into the subsequent portion 30c as well as the preceding portion 30a. This time portion 56 may comprise the above-described aliasing cancellation portions 54a and 54b at the borders between the current frame 30b and the portions 30a and 30c of immediately preceding and succeeding frames, respectively. Thus, in order to complete the window-based spectral-to-time transformation for the current frame 30b, recombiner 110 recombines, i.e. overlaps and adds, within aliasing cancellation portions 54a and 54b the time-domain signals obtained by the window based spectral-to-time transformation for the consecutive TCX frames in order to yield the actual time-domain reconstructed version of these portions.

CELP decoder 104 is configured to use a codebook index comprised within a current CELP frame in order to build the excitation signal for the current frame, and apply a synthesis filter depending on the linear prediction filter coefficient for the current frame to the excitation signal so as to obtain the time-domain signal of the current CELP frame. As described above, the CELP decoder 104 may use ACELP, and in this case CELP decoder 104 may retrieve an innovation codebook index along an adaptive codebook parameter from the current frame. CELP decoder 104 uses the index in order to reconstruct the adaptive codebook excitation and the innovation codebook excitation, respectively. For example, using the adaptive codebook parameter, CELP decoder 104 may construct the adaptive codebook excitation by modifying/interpolating the past reconstructed excitation depending on the adaptive codebook parameter. The CELP decoder 104 may combine this adaptive codebook excitation with an innovation codebook excitation in order to yield the reconstructed version of the current excitation. In order to obtain the innovation codebook excitation, CELP decoder 104 evaluates the innovation codebook index. Both adaptive codebook excitation and innovation codebook excitation are combined with each other by way of a weighted sum with weighting factors also being determined by CELP decoder 104 via the adaptive codebook parameter and the innovation codebook index. As already noted above, the reconstructed excitation of the current frame forms the basis for determining the adaptive codebook excitation of the following CELP frame. Recombiner 110 puts together the reconstructed version of consecutive CELP frames as output by CELP decoder 104. As already briefly noted above, recombiner 110 may be configured to perform special measures at transitions between the TCX coding mode and the CELP coding mode, respectively. In particular, recombiner 110 may evaluate additional information contained in the data stream 32 to this end. At the output of recombiner 1 10, the reconstructed version 116 of the original audio signal 24 results.

The optional bandwidth extension module 112 may extend the bandwidth of the reconstructed signal 116 as obtained by recombiner 110 into, for example, the higher frequency portion (see 72 in Fig. 1). For example, in case of SBR, bandwidth extension module 1 12 may apply a spectral analysis on signal 116 by use of, for example, an analysis filterbank such as a QMF or CLDFB filterbank so as to obtain a spectrogram thereof in a spectral/temporal resolution within lower frequency region 78, the temporal component of which exceeds the portion rate of portions 30a to 30c. Bandwidth extension module 1 12 uses this spectrogram in order to pre-fill, such as by replication, the high frequency portion 72 with then spectrally forming the pre-filled version using the SBR data forwarded by frame buffer 102 for the individual frame 34a to 34c in the grid resolution 74. Using a synthesis filterbank such as a QMF or CLDFB filterbank, the bandwidth extension module 1 12 may then retransfer the spectrally expanded spectrogram extending over both frequency portions 78 and 72 to time-domain in order to yield the reconstruction of the audio signal.

As turned out from the above discussion, the audio codec described above with respect to Figs. 1 and 2, and in accordance with embodiments of the present invention, is able to provide for a high coding efficiency even in case of dealing with audio signals of differing type, such as speech and non-speech signals. Beyond that, the coding delay is low.

For example, the delay reduction provided by the above-described embodiments may be so low that same are suitable for a two-way communication. Despite the delay restrictions, high music quality is obtainable as well as a speech quality comparable to a specially dedicated speech codecs.

In order to give specific examples, the portions 30a to 30c mentioned above may have a length of 256 samples each. At a sampling rate of 12.8 kHz this results in a frame/portion length of 20 ms. If using SBR as the bandwidth extension, the original audio signal may have a sampling rate, or the bandwidth extension module 18 may operate on a sampling rate, of double the sampling rate underlying the CELP and TCX coding, i.e. 25.6 kHz. Of course, the ratio of 2: 1 merely serves as an example and other ratios are feasible as well, such as 2.5: 1 leading to a sampling rate of 32 kHz which the bandwidth extension module operates on. Additionally, other sampling rates than 12.8 kHz are also feasible in connection with the CELP and TCX coding modes. However, in case of the just-mentioned 20 ms framing mode using SBR with internal sampling rate 12.8 kHz and external sampling rate 25.6 kHz, the resulting delay of the above outlined embodiments might be 45 ms in total. 20 ms stem from the framing structure of the frames 34a to 34 itself. Another 20 ms may stem from the overlap of the window functions of the TCX coding mode. That is, the transformation length 56 may be 40ms or 512 samples, respectively. 2.5 ms may result from the window functions involved in the analysis and synthesis filterbanks involved in the bandwidth extension performing the SBR. Finally, another 2.5 ms may result from additional filtering and resampling measures not described in detail above.

The 20 ms resulting from the overlap between overlapping TCX window functions may be reduced down to, for example, 5-7 ms by use of low-overlap or low-delay windows where in actual non-zero portion of the window 50 is smaller than the extension of the transformation length 56.

By this measure, a kind of super-wideband (SWB) mode would be obtainable.

However, if the spectral extension obtained by the bandwidth extension module is not so critical, i.e. the spectral extension may be lower, a wideband (WB) mode may be used according to which the 20 ms framing structure combined with the 12.8 kHz internal sampling rate is used, with, however, using blind bandwidth extension known from, for example, AMR-WB instead of SBR to extend the bandwidth from, for example, 6.4 kHz (c.p. 78 in Fig. 1) to 7 kHz (c.p. 78 and 72 in Fig. 1). The resulting delay may then be reduced to 43 ms in total. Again, 20 ms stem from the framing structure itself and another 20 ms from the overlap between consecutive TCX windows, which time delay may, as just mentioned, be reduced down to 5 or 7 ms. Lastly, another 3 ms stem from filtering and resampling.

Finally, a narrow band (NB) mode may be obtained by omitting any bandwidth extension. In this case, the 20 ms framing structure combined with the 12.8 kHz internal sampling rate may be used. A resampling from 8 kHz to 12.8 kHz may be used in order to use the same coding kernel as for the WB mode. In this case, the resulting delay is again 43 ms in total, namely the 20 ms stemming from the framing structure, another 20 ms stemming from the overlap between consecutive TCX windows, which delay may, as mentioned above, be reduced down to 5 or 7 ms by use of low-overlap or low-delay windows, and another 3 ms due to filtering and resampling. Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus. The inventive encoded audio signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.

Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.

Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.

Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.

Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier. In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer. A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non- transitionary.

A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.

A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.

A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.

A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver .

In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are preferably performed by any hardware apparatus.

The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.

Claims

1. Unified speec and audio decoder comprising a frame buffer (102) configured to buffer a subpart of a data stream (32) composed of consecutive frames (34a, 34b, 34c) in units of the frames so that the subpart continuously comprises at least one frame, each frame representing a coded version of a respective portion of consecutive portions (30a, 30b, 30c) of an audio signal (32), and each frame comprising a mode identifier assigning the respective frame (34a, 34b, 34c) to a respective one of a plurality of coding modes comprising a CELP coding mode and a transform coded excitation LP coding mode; a CELP decoder configured to decode the frames to which the CELP coding mode is assigned, to reconstruct the respective portions of the audio signal; a transform coded excitation LP decoder (106) configured to decode the frames to which the transform coded excitation LP coding mode is assigned, to reconstruct the respective portions of the audio signal, and wherein the frame buffer (102) is configured to distribute the frames buffered, to the CELP decoder (104) and the transform coded excitation LP decoder ( 06) under removal of the respective frames from the frame buffer, frame- wise.

2. Unified speech and audio decoder according to claim 1, wherein the portions (30a, 30b, 30c) are coded using bandwidth extension with each frame including respective bandwidth extension information and the unified speech and audio decoder further comprises a bandwidth extension module (1 12) configured to perform bandwidth extension on the reconstructed portions portion-wise.

3. Unified speech and audio decoder according to claim 1 or 2, wherein the data stream (32) comprises information on LP filter coefficients for each of the frames (34a, 34b, 34c), and the CELP decoder (104) is configured to decode the frames to which the CELP coding mode is assigned, by building an excitation signal for the respective frame using a codebook index comprised within the respective frame and applying a synthesis filter depending on the linear prediction filter coefficients for the respective frame onto the excitation signal, and the transform coded excitation LP decoder (106) is configured to decode each frame to which the transform coded excitation LP coding mode is assigned, by decoding a weighted spectral representation from the respective frame, re-weighting the weighted spectral representation in accordance with the linear prediction filter coefficients for the respective frame, and retransforming the re-weighted spectral representation by use of window based spectral-to-time transformation.

4. Unified speech and audio encoder comprising a mode switch (12) configured to assign to each of the consecutive portions (34a, 34b, 34c) of an audio signal (32) a respective one of a plurality of coding modes merely consisting of a CELP coding mode and a transform coded excitation LP coding mode; a CELP encoder (14) configured to encode the portions to which the CELP coding mode is assigned, to obtain CELP frames; and a transform coded excitation LP encoder (16) configured to encode the portions to which the transform coded excitation LP coding mode is assigned, to obtain transform coded frames, wherein the unified speech and audio encoder is configured such that each CELP frame has a coding mode identifier identifying the CELP coding mode, and each transform coded frame has an identifier identifying the transform coded excitation LP coding mode.

5. Unified speech and audio encoder according to claim 4, further comprising a bandwidth extension module (18) configured to generate bandwidth extension information for the portions (30a, 30b, 30c) and inserting the respective bandwidth extension information into the data stream (32), frame- wise.

6. Unified speech and audio encoder according claim 4 or 5, wherein the CELP encoder (14) and the transform coded excitation LP encoder (16) comprise an LP analyzer (28) configured to generate LP filter coefficients for each of the portions (30a, 30b, 30c) and encode information on the LP filter coefficients into the data stream (32), wherein the CELP encoder (14) is configured to apply an analysis filter based on the LP filter coefficients to the portions (30a, 30b, 30c) to which the CELP coding mode is assigned, in order to yield an excitation signal, approximate the excitation signal using a codebook index and insert the codebook index into the respective frame of the data stream (32), and the transform coded excitation LP encoder (16) is configured to generate a spectral representation of the portions 30a, 30b, 30c) to which the transform coded excitation LP coding mode is assigned, by use of window based time-to-spectral transformation, weighting the spectral representation in accordance with the LP filter coefficients, and coding the weighted spectral representation into the respective frame.

7. Unified speech and audio decoding method, comprising buffer a sub-part of a data stream (32) composed of consecutive frames (34a, 34b, 34c) in units of the frames in a frame buffer (102) so that the subpart continuously comprises at least one frame, each frame representing a coded version of a respective portion of consecutive portions (30a, 30b, 30c) of an audio signal (32), and each frame comprising a mode identifier assigning the respective frame (34a, 34b, 34c) to a respective one of a plurality of coding modes comprising a CELP coding mode and a transform coded excitation LP coding mode; decode, in a CELP decoder, the frames to which the CELP coding mode is assigned, to reconstruct the respective portions of the audio signal; decode, in a transform coded excitation LP decoder (106), the frames to which the transform coded excitation LP coding mode is assigned, to reconstruct the respective portions of the audio signal, and distribute the frames buffered, to the CELP decoder (104) and the transform coded excitation LP decoder (106) under removal of the respective frames from the frame buffer, frame-wise.

8. Unified speech and audio encoding method comprising assigning to each of consecutive portions (34a, 34b, 34c) of an audio signal (32) a respective one of a plurality of coding modes merely consisting of a CELP coding mode and a transform coded excitation LP coding mode; encoding, in a CELP encoder (14), the portions to which the CELP coding mode is assigned, to obtain CELP frames; and encoding, a transform coded excitation LP encoder (16), the portions to which the transform coded excitation LP coding mode is assigned, to obtain transform coded frames, wherein each CELP frame has a coding mode identifier identifying the CELP coding mode, and each transform coded frame has an identifier identifying the transform coded excitation LP coding mode.

9. Data stream comprising consecutive portions (34a, 34b, 34c) each of which has assigned a respective one of a plurality of coding modes merely consisting of a CELP coding mode and a transform coded excitation LP coding mode, wherein each CELP frame has a coding mode identifier identifying the CELP coding mode, and each transform coded frame has an identifier identifying the transform coded excitation LP coding mode.

10. Computer readable digital storage medium having stored thereon a computer program having a program code for performing, when running on a computer, a method according to claim 7 or 8.