WO2009136872A1

WO2009136872A1 - Method and device for encoding an audio signal, method and device for generating encoded audio data and method and device for determining a bit-rate of an encoded audio signal

Info

Publication number: WO2009136872A1
Application number: PCT/SG2009/000163
Authority: WO
Inventors: Te Li; Susanto Rahardja
Original assignee: Agency For Science, Technology And Research
Priority date: 2008-05-07
Filing date: 2009-05-06
Publication date: 2009-11-12

Abstract

A method for encoding an audio signal including a core audio signal portion and a residual audio signal portion is described, which includes selecting, from the residual audio signal portion a first candidate residual audio signal portion and a second candidate residual audio signal portion; comparing the first candidate residual audio signal portion with the second candidate residual audio signal portion; and, depending on the result of the comparison, encoding the audio signal using the second candidate residual audio signal portion.

Description

Method and device for encoding an audio signal, method and device for generating encoded audio data and method and device for determining a bit-rate of an encoded audio signal

Embodiments of the invention generally relate to a method and device for encoding an audio signal, a method and a device for generating encoded audio data and a method and a device for determining a bit-rate of an encoded audio signal.

Online music stores such as Apple's iTunes have shown the viability of online music sales. For example, in case of iTunes, the music files are encoded in MPEG-4 advanced audio coding (AAC) format at 128kbps.

An existing limitation of online music sales is that the music files are only offered at fixed bitrates in lossy compressed form. However, with the proliferation of broadband access and continuous decline of memory storage prices, there is an increasing number of music lovers who wish to purchase their favorite music online at a high resolution which is as good as CD quality. On the other hand, some users may prefer to purchase songs that are cheaper but are encoded at a lower bit rate. This is because the perceptual audibility between, for example, an audio file encoded at 96kbps and an audio file encoded at 128kpbs is transparent or not important to such users, for example in case that the music is played on a mobile device.

In order to satisfy the various bit rate and quality requirements of their customers, music stores may archive different versions of the same piece of music at different bitrates on their file servers. This is a burden for the file servers since it increases the complexity of the data management and the amount of necessary storage space.

Alternatively, a music store may prefer to encode songs at a required bit rate only when a purchase order is received for the required bit rate. This, however, is both time consuming and computationally intensive. Moreover, there may be customers who may wish to upgrade a piece of music, e.g. a song that they have purchased to a better quality, for example in case that they want to listen to the song using a hi-fi system. In this case, the only option would be to purchase and download the entire song with a larger size, resulting in them having to keep different versions of the same song. It is therefore not pragmatic to employ fixed bit rate audio coding on an online music store offering multiple qualities for songs for both the online music store and its customers.

Fine granular scalable audio coding, which has been extensively studied recently, provides a solution to variable quality requirements. Released by the ISO in late 2006, MPEG- 4 audio scalable lossless coding (SLS, cf. e.g. [1] and [2]) integrates the functions of lossless audio coding, perceptual audio coding and fine granular scalable audio coding in a single framework. It allows the scaling up of a perceptually coded representation such as a MPEG-4 AAC coded piece of audio to a lossless representation of the piece of audio with fine granular scalability.

Furthermore, a music manager system based on SLS coding technology has been designed for online music stores. With such a music manager system, a server maintained by an online store is able to deliver songs to its clients at various bitrates and prices with single file archival for each piece of music. The processing of the files may be performed very fast and the upgrading of the quality of a piece of music by a customer can be easily and efficiently achieved by offering a "top-up" to the original song without the need of keeping multiple copies for the same piece of music.

In this way, multi-bit rate music files are currently provided by online music stores. However, this does not mean that multi-quality music sales are already available. This is because the quality of music at a fixed bit rate may vary for different pieces of music.

The multi-quality model is more desirable, but there is a lack of a clear link between scalable audio bit rate and perceptual quality.

Embodiments may be seen to be based on the problem to provide a method for providing an encoded audio signal which allows providing audio signals that meet a pre-defined quality requirement .

This problem is solved by the methods and devices according to the independent claims.

In one embodiment, a method for encoding an audio signal including a core audio signal portion and a residual audio signal portion is provided. The method includes selecting, from the residual audio signal portion a first candidate residual audio signal portion and a second candidate residual audio signal portion; comparing the first candidate residual audio signal portion with the second candidate residual audio signal portion; and, depending on the result of the comparison, encoding the audio signal using the second candidate residual audio signal portion.

In one embodiment, a method for generating encoded audio data is provided that includes including an indication into the audio data specifying at least one part of the encoded audio data which does not have to be decoded for the decoded audio data to meet a pre-determined quality level.

According to one embodiment, a method for determining a bit- rate of an encoded audio signal generated from an original audio signal is provided that includes determining, for a representation of the original audio signal as a core audio signal portion and a residual audio signal portion, a measure of the residual audio signal portion; and determining the bit-rate based on a comparison of the measure of the residual audio signal portion with a pre-defined residual threshold, wherein the pre-defined residual threshold is based on a predefined quality level.

According to other embodiments, devices according to the methods described above are provided. Embodiments which are described in context with the methods described above are analogously valid for the respective device and vice versa.

Illustrative embodiments of the invention are explained below with reference to the drawings.

FIG. 1 shows a frequency - sound pressure level diagram.

FIG. 2 shows a time - threshold increase diagram.

FIG. 3 shows a frequency - audio energy diagram. FIGs. 4A and 4B show frequency - audio energy diagrams.

FIG. 5 shows a flow diagram according to an embodiment.

FIG. 6 shows an encoder according to an embodiment.

FIG. 7 shows an encoder according to an embodiment.

FIG. 8 shows a decoder according to an embodiment.

FIG. 9 shows a bit plane diagram.

FIG. 10 shows a hierarchy of audio files.

FIG. 11 shows a data flow diagram.

FIG. 12 shows a hierarchy of audio files.

FIG. 13 shows a hierarchy of audio files.

FIG. 14 shows a flow diagram according to an embodiment.

FIG. 15 shows a bit plane diagram.

In some situations, an otherwise clearly audible sound can be masked by another sound. For example, conversation at a bus stop can be completely impossible if an incoming bus producing loud noise is driving past. This phenomenon is called Masking. A weaker sound is masked if it is made inaudible in the presence of a louder sound. If two sounds occur simultaneously and one is masked by the other, this is referred to as simultaneous masking. Simultaneous masking is also sometimes called frequency masking. This is illustrated in Figure 1.

FIG. 1 shows a frequency-sound pressure level diagram 100.

Frequency values correspond to the values along a first axis (x-axis) 101 and sound pressure levels (in dB) correspond to values along a second axis (y-axis) 102.

A first line 103 illustrates a high intensity signal. The high intensity signal behaves like a masker and is able to mask a relatively weak signal (illustrated by a second line 104) in a nearby frequency range. The masking level is illustrated by a dashed line 105 while the audible level without masking is illustrated by a solid line 106.

Similarly to frequency masking, a weak sound emitted soon after the end of a louder sound may be masked by the louder sound. Even a weak sound just before a louder sound can be masked by the louder sound. These two effects are called pre- and post temporal masking, respectively. They are illustrated in figure 2.

FIG. 2 shows a time - threshold increase diagram 200.

Frequency values correspond to the values along a first axis (x-axis) 201 and the sound pressure level (in dB) correspond to values along a second axis (y-axis) 202. A solid line 203 illustrates the audibility threshold increase that is caused by a masking signal illustrated by a block 204.

Masking may be applied in audio compression to determine which frequency components can be discarded or more compressed (e.g. by rougher quantization) .

For example, perceptual audio coding is a method of encoding audio that uses psychoacoustic models to discard data corresponding to audio components which may not be perceived by humans .

Typically, frequencies from an audio signal are eliminated that the human ear cannot hear. Perceptual audio coding may also eliminate softer sounds that are being drowned out by louder sounds, i.e., advantage of masking may be taken. For example, an audio signal is first decomposed in several critical bands using filter banks. Average amplitudes are calculated for each frequency band and are used to obtain corresponding hearing thresholds.

A typical plot of audio signal energy and its mask calculated from a psychoacoustic model is shown in Figure 3.

FIG. 3 shows a frequency - audio energy diagram 300.

Frequencies (in terms of scale factor bands) correspond to the values along a first axis (x-axis) 301 and the audio energy levels (in dB) correspond to values along a second axis (y-axis) 302. A first line 303 indicates the audio energy per scale factor band for an exemplary audio file for one frame of the audio file (wherein a frame includes a plurality of sample values of an audio signal) . A second line 304 indicates the masking level that is caused by the sounds of the audio file.

Signals with energies below the modified threshold (i.e. the masking level) may be considered inaudible and may be discarded in perceptual coding. In this way, the total entropy of the audio signal can be reduced opening possibilities to obtain higher compression rates.

A reason to use perceptual audio coding is to reduce the file size of an audio file, since high guality audio with no compression can lead to very big file sizes. Perceptual audio coding is known as lossy audio compression because it is "losing" pieces of sounds (the ones removed that cannot be heard) . For example, the MP3 (MPEG-I Layer 3) and AAC (MPEG-4 advanced audio coding) are perceptual coding methods. Another type of compression is lossless compression, which encodes repetitive information to symbols to reduce audio file size. This allows a user to reconstruct an exact copy of the original audio signal. However, with lossless compression, the rate of compression that can be achieved is not as high as the rate of compression achievable with lossy compression. For example, FLAC (free lossless audio coding), Monkey's Audio and ALS (MPEG-4 audio lossless coding) are lossless coding methods .

The noise (distortion) of an audio signal may be defined as the difference between the original audio signal and the compressed audio signal. It is shown by various types of subject tests that if the energy of the noise is controlled to be below the audio masking level (provided that the mask calculation is accurate enough) the noise is not perceptible by ("typical") human ears. Thus, the quality of the compressed audio signal will be "transparent" compared to the quality of the original audio signal.

For example, for MPEG-4 AAC standard codec, the transparent quality can be achieved around 128-192kbps (stereo) . Example plots of the noise generated under MPEG-4 AAC at 64kbps and 128kbps are shown in Figures 4A and 4B

FIGs. 4A and 4B show frequency - audio energy diagrams 401 and 402.

Frequencies (in terms of scale factor bands) correspond to the values along a respective first axis (x-axis) 401 and the audio energy levels (in dB) correspond to values along a respective second axis (y-axis) 402.

A respective first line 403 indicates the audio energy per scale factor band for an exemplary audio file for one frame of the audio file. A respective second line 404 indicates the masking level that is caused by the sounds of the audio file and a respective third line 405 indicates the noise level in case of MPEG-4 AAC compression at 64kbps (FIG. 4A) and at 128kpbs (FIG. 4B), respectively.

It can be shown that for the near-transparent bit rate, most of the noise is below the audio mask.

In one embodiment, a method is provided that allows a scalable audio coder to encode an audio signal using a minimum enhancing bit rate for the particular audio signal necessary to achieve transparent encoding quality of the audio signal. The input for the method may for example be a low quality perceptual audio signal and the original (i.e. uncompressed) audio signal. In one embodiment, by applying perceptual information from a psychoacoustic model (which may be a conventional psychoacoustic model) the encoding based on the method provided is able to stop at an optimal point (e.g. at an optimal position in the bit plane scanning process) for which transparent quality can just be achieved. This method can for example be used to satisfy multi-quality music requirements, such as they arise for an online music store.

FIG. 5 shows a flow diagram 500 according to an embodiment.

The flow diagram 500 illustrates a method for encoding an audio signal including a core audio signal portion and a residual audio signal portion.

In 501, a first candidate residual audio signal portion and a second candidate residual audio signal portion are selected from the residual audio signal portion.

In 502, the first candidate residual audio signal portion is compared with the second candidate residual audio signal portion.

In 503, depending on the result of the comparison, the audio signal is encoded using the second candidate residual audio signal portion.

In other words, in one embodiment, a second candidate residual audio signal portion, e.g. a part of the residual audio signal portion such as a sub-set of the set of bits of the residual audio portion, is compared with a first candidate residual audio signal portion which may be a part of the residual audio signal portion that allows a higher quality than the second candidate residual audio signal portion. Based on the comparison, it may be decided whether the second candidate residual audio signal portion is used for encoding or not, for example whether the audio quality of an audio signal reconstructed for it is sufficient or whether additional data should be added to allow a higher quality of the reconstructed audio signal.

In one embodiment, the core audio signal portion includes a plurality of core audio signal values and wherein the residual audio signal portion includes a plurality of residual audio signal values.

In one embodiment, the first candidate residual audio signal portion is the residual audio signal portion.

In one embodiment, the first candidate residual audio signal portion is different from the second candidate residual audio signal portion.

In one embodiment, depending on the result of the comparison, the audio signal is encoded using the first candidate residual audio signal portion or a pre-defined process is performed.

The pre-defined process may include selecting a third candidate residual audio signal portion different from the second candidate residual audio signal portion and encoding the audio signal using the third candidate residual audio signal portion. Comparing the first candidate residual audio signal portion with the second candidate residual audio signal portion may include checking whether the difference between at least one first residual value reconstructed from the first candidate residual audio signal portion according to a pre-determined reconstruction scheme and at least one second residual value reconstructed from the second candidate residual audio signal portion according to the pre-determined reconstruction scheme is lower than a pre-defined threshold.

The pre-defined threshold is for example based on a human hearing perception threshold. For example, the pre-defined threshold is based on a human hearing mask.

In one embodiment, encoding the audio signal includes generating a bit stream including the second candidate residual audio signal portion

The residual audio signal portion for example includes a plurality of residual audio signal values, and selecting the second candidate residual audio signal portion is performed based on a bit word representation of each residual audio signal value and based on an order of the residual audio signal values.

In one embodiment, selecting the second candidate residual audio signal portion includes determining a minimum bit significance level; determining a border signal value position; including all bits of the bit word representations of the residual audio signal values that have, in their respective bit word, a higher bit significance level than the minimum bit significance level; and including all bits of the bit word representations of the residual audio signal values that have, in their respective bit word, the minimum bit significance level and that are part of a bit word representation of a signal value that has a position which fulfills a pre-defined criterion with regard to the border signal value position.

For example, the pre-defined criterion is one of the position being below the border signal value position, the position being below or equal to the border signal value position, the position being above the border signal value position and the position being above or equal to the border signal value position.

In one embodiment, the minimum bit significance level is determined based on a comparison of the first candidate residual audio signal portion with a fourth candidate residual audio signal portion or is a pre-defined minimum bit significance level.

In one embodiment, the border signal value position is determined based on a comparison of the first candidate residual audio signal portion with a fifth candidate residual audio signal portion or is a pre-defined border signal value position.

Each residual audio signal value may correspond to at least one frequency. For example, each residual audio signal value corresponds to at least one scale factor band.

The method for encoding an audio signal illustrated in figure 5 is for example carried out by an encoder as illustrated in figure 6. FIG. 6 shows an encoder 600 according to an embodiment.

The encoder 600 serves for encoding an audio signal 601 including a core audio signal portion 602 and a residual audio signal portion 603.

The encoder 600 includes a selecting circuit 604 configured to select, from the residual audio signal portion a first candidate residual audio signal portion and a second candidate residual audio signal portion.

Further, the encoder 600 includes a comparing circuit 605 configured to compare the first candidate residual audio signal portion with the second candidate residual audio signal portion.

Additionally, the encoder 600 includes an encoding circuit 606 configured to, depending on the result of the comparison, encode the audio signal using the second candidate residual audio signal portion.

The encoder may include a memory which is for example used in the processing carried out by the encoder. A memory used in the embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a non-volatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory) . In an embodiment, a "circuit" may be understood as any kind of a logic implementing entity, which may be special purpose circuitry or a processor executing software stored in a memory, firmware, or any combination thereof. Thus, in an embodiment, a "circuit" may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g. a microprocessor (e.g. a Complex Instruction Set Computer (CISC) processor or a Reduced Instruction Set Computer (RISC) processor) . A "circuit" may also be a processor executing software, e.g. any kind of computer program, e.g. a computer program using a virtual machine code such as e.g. Java. Any other kind of implementation of the respective functions which will be described in more detail below may also be understood as a "circuit" in accordance with an alternative embodiment.

An example for an encoder 600, for example configured to perform the method illustrated in figure 5 is shown in figure 7.

FIG. 7 shows an encoder 700 according to an embodiment.

The encoder 700 receives an audio signal 701 as input, which is for example an original uncompressed audio signal which should be encoded to an encoded bit stream 702.

The audio signal 701 is for example in integer PCM (Pulse Code Modulation) format and is losslessly transformed into the frequency domain by a domain transforming circuit 706 which for example carries out an integer modified discrete Cosine transform (IntMDCT) . The resulting frequency coefficients (e.g. IntMDCT coefficients) are passed to a lossy encoding circuit 703 (e.g. an AAC encoder) which generates the core layer bit stream, e.g. an AAC bit stream, in other words a core audio signal portion. The lossy encoding circuit 703 for example groups the frequency coefficients grouped into scale factor bands (sfbs) and quantizes them for example with a nonuniform quantizer. In order to efficiently utilize the information of the spectral data that has been coded in the core layer bit stream, an error-mapping procedure is employed by an error mapping circuit 704 which receives the frequency coefficients and the core layer bit stream as input to generate an residual spectrum (e.g. a lossless enhancement layer, LLE) , in other words a residual audio signal portion, by subtracting the quantized frequency coefficients generated by the lossy encoder (e.g. the AAC quantized spectral data) from the original frequency coefficients. The encoder 700 may thus be seen to include a core layer and a (lossless) enhancement layer.

As an example, for k = {0, 1, ... , N - 1} where N is the dimension of IntMDCT, the residual signal e[k] is for example computed by

c[k], i[k] = 0 e[k] =

[c[k] - |_thr (i[k])J , i[k] ≠ 0 (D

where c[k] is the IntMDCT coefficient, i[k] is the quantized data vector produced by the quantizer (i.e. the lossy encoding circuit 703) , [•] is the flooring operation that rounds off a floating-point value to its nearest integer with smaller amplitude and thr (i[k]) is the low boundary (towards- zero side) of the quantization interval corresponding to i[k].

The residual spectrum is then encoded by a bit stream encoding circuit 705, for example according to the bit plane Golomb code (BPGC) , context-based arithmetic code (CBAC) and low energy mode coding (LEMC) to generate a scalable enhancement layer bit stream (e.g. a scalable LLE layer bit stream) .

Finally, the scalable enhancement layer bit stream is multiplexed by a multiplexer 707 with the core layer bit stream to produce the encoded bit stream 702.

The encoded bit stream 702 may be transmitted to a receiver which may decode it using a decoder corresponding to the encoder 700. An example for a decoder is shown in figure 8.

FIG. 8 shows a decoder 800 according to an embodiment.

The decoder 800 receives an encoded bit stream 801 as input. A bit stream parsing circuit 802 extracts the core layer bit stream 803 and the enhancement layer bit stream 804 from the encoded bit stream. The enhancement layer bit stream 804 is decoded by a bit stream decoding circuit 805 corresponding to the bit stream encoding circuit 705 to reconstruct the residual spectrum as exact as it is possible from the transmitted encoded bit stream 801.

The core layer bit stream 803 is decoded by a lossy decoding circuit 806 (e.g. an AAC decoder) and is combined with the reconstructed residual spectrum by an inverse error mapping circuit to generate the reconstructed frequency coefficients. The reconstructed frequency coefficients are transformed into the time domain by a domain transforming circuit 808 corresponding to the domain transforming circuit 706 (e.g. an integer inverse MDCT) to generate a reconstructed audio signal 809.

By selecting how much of the residual spectrum generated by the error mapping circuit 704 is transmitted to the decoder 800, the reconstructed audio signal 809 is scalable from lossy to lossless.

In SLS (MPEG-4 scalable lossless) coding, the bit stream encoding circuit 705 carries out a bit plane scanning scheme for encoding the residual spectrum. SLS, using this bit plane scanning scheme, allows the scaling up of a perceptually coded representation such as MPEG-4 AAC to a lossless representation with a wide range of intermediate bit rate representations .

The bit plane scanning scheme in SLS is illustrated in figure 9.

FIG. 9 shows a bit plane diagram 900.

In the bit plane diagram 900, the residual spectrum values are represented as bit words (i.e. words of bits), wherein each bit word is written as a column and the bits of each bit word are ordered according to their significance from most significant bit.

The significance of the bits in their respective bit word increases along a first axis 901 (y-axis) . Each residual spectrum value for example corresponds to a frequency and belongs to a scale factor band. The scale factor band (sfb) number increases from left to right (from 0 to s-1) along a second axis 902 (x-axis) .

The scanning process carried out by the bit stream encoding circuit 705 starts from the most significant bit of spectral data (i.e. of the residual spectrum values) for all scale factor bands. It then progresses to the following bit planes until it reaches the least significant bit (LSB) for all scale factor bands. Starting from the fifth bit plane or in this example the seventh bit plane (for CBAC), the bit plane scanning process enters the Lazy-mode coding for the lazy bit planes where the probability of a bit to be 0 or 1 is assumed to be equal.

As the frequency assignment rule of BPGC is derived from the Laplacian probability density function, BPGC only delivers excellent compression performance when the sources are near- Laplacian distributed. However, for some music items, there exist some "silence" time/frequency regions where the spectral data are in fact dominated by the rounding errors of IntMDCT. In order to improve the coding efficiency, low energy mode coding may be adopted for coding signals from low energy regions. A scale factor band is defined as low energy if L[s] < 0 where L[s] is the lazy bit plane as defined in [1] and [2] .

It is possible to improve the coding efficiency of BPGC by further incorporating more sophisticated probability assignment rules that take into account the dependencies of the distribution of IntMDCT spectral data to several contexts such as their frequency locations or the amplitudes of adjacent spectral lines, which can be effectively captured by using CBAC. For CBAC coding, the seventh and below bit planes are set as absolute lazy bit planes in SLS reference codec.

In a commercial application of MPEG-4 SLS, for example an online music store and music manager, a piece of music may be provided using different file versions depending on the level of compression, as it is illustrated in figure 10.

FIG. 10 shows a hierarchy of audio files 1001, 1002, 1003, 1004.

A first audio file is an uncompressed (.wav) file 1001. A second audio file 1002 is an SLS high-quality lossy file at total bit rate of 256kbps, which contains a AAC core format at 64kbps and a LLE enhancement at 192kbps. This high-quality lossy file can be truncated to a third audio file 1003 which is a low-quality lossy file with AAC format at 64kbps, which can be used for mobile devices that normally have limited storage size. A further "top-up" or "upsize" track (around 500kbps) for upgrade to lossless version in a fourth audio file 1004 can be added onto the high-quality lossy format in the second audio file 1002 and can be sold separately. The lossless format of the fourth audio file 1004 may also be available as a whole and may be sold at a higher price compared to the SLS high-quality lossy format of the second audio file 1002.

In the following, it is described with reference to figure 11 how an SLS music manager may work.

FIG. 11 shows a data flow diagram 1100. The data flow takes place between a server 1101 and a client 1102. The client 1102 for example has a mobile device 1103 and a hi-fi audio device 1104.

On the server side the music in raw wave format (CD format) 1105 together with the AAC encoded format are encoded using a SLS encoder 1106 to produce the lossless compressed format consisting of the AAC core layer (64kbps) 1108 and the lossless enhancement layer (LLE) 1107.

The archived format is then truncated by a truncator 1112 into three tracks 1108, 1109, 1110 which consist of the AAC track 1108 (AAC core layer) and two enhancement layer tracks 1109, 1110 with the first layer 1109 at 192kbps. All songs are stored in the server 1101 in this truncated lossless format 1115 only. If the client wants to buy the losslessly compressed audio, the full version (lossless format) including all three tracks 1108, 1109, 1110 will be available for them to download.

If the client 1102 wants good-quality lossy version instead, the 256kbps format 1113 is extracted from lossless format using an extractor 1111.

If the client has purchased the lossy version and further decides to upgrade to lossless version, top-up track 1114 is extracted from the lossless format using the extractor 1111 and sent to the client 1102.

On the client side, if the client 1102 has bought the lossy version 1113 or the lossless format, these versions can be directly decoded by a player. If the client has bought the lossy version 1113 and wants to upgrade to the lossless version, he/she just needs to download the top-up track 1114. This top-up track 1114 can be patched together with lossy version 1113 to achieve the lossless version.

If the client wishes to transfer the downloaded music to the mobile device 1103 which normally has limited storage size and playback quality, the AAC core 1108 can be extracted from the downloaded music using an extractor function of the music player.

In the music manager model of the SLS coder as described above, the high-quality lossy enhancing bit rate is fixed at 192 kbps . However, the quality at this bit rate actually varies with different input audio sequences. For some audio sequences, this bit rate is good enough to achieve a transparent quality. At the same time, 192kbps may not be enough for some audio inputs with high dynamic range or intensive energy.

In one embodiment, a method for smart enhancing is provided. This is illustrated in figure 12.

FIG. 12 shows a hierarchy of audio files 1201, 1202, 1203.

In one embodiment, smart enhancing provides the function that, with a low-quality audio input file (e.g. an AAC 64kbps input) 1201 and its original (uncompressed) format, it enables a scalable encoder to automatically encode the minimum amount of enhancing bits necessary to generate a transparent quality audio file 1202 for this particular input. This transparent quality lossy format can also be further "topped-up" (upgraded) to a lossless format audio file 1203.

In addition, in one embodiment, a transparent bit rate estimation function is provided as it is illustrated in figure 13.

FIG. 13 shows a hierarchy of audio files 1301, 1302, 1303.

Given a lossless SLS audio format file 1301 including an encoded audio signal an estimation function provided according to one embodiment estimates the transparent bit rate for this audio signal. This bit rate can be used to perform a smart truncation to a transparent quality audio file 1302 or a low-quality audio file 1303. The estimated transparent quality format may, however, be not as accurate as the one obtained with smart enhancing as illustrated in figure 12.

With the smart enhancing function provided according to one embodiment, three versions of audio with three qualities which include low quality lossy, transparent quality lossy and lossless quality may be provided. By using normal PC speakers and portable music players, the CD quality and the transparent quality will be indifferent for most of the listeners .

A process carried out by the encoder 700 according to one embodiment, for example for providing smart enhancing, is explained in the following with reference to figure 14.

FIG. 14 shows a flow diagram 1400 according to an embodiment The process is started in 1401 with the first frame of the input audio signal 701. The input audio signal in the time domain is transferred into spectrum domain by the domain transforming circuit 706 (e.g. by an IntMDCT transform (with or without M/S (Main/Side) coding) ) and encoded in 1402 by the lossy encoding circuit 703, e.g. according to the MPEG-4 AAC encoding method. A perceptual hearing mask for this frame, given by a set of energy level values M[s], is generated in the coding process, with 0 < s < S where s is the scale factor band (sfb) number and S is the total number of scale factor bands.

In another embodiment, besides the original audio signal 701, a lossy encoded audio signal (e.g. a coded AAC signal) is also input to the encoder 700. In this case, a psychoacoustic model (standard model I, standard model II or any open source model) is used in the process and the mask is generated from this model.

In 1403, as described above with reference to figure 7, the error mapping process carried out by the error mapping circuit 704 calculates the difference between the lossy coded frequency components (e.g. the AAC coded spectrum) and the original spectrum provided by the domain transforming circuit 706.

In this embodiment, the residual signal values provided by the error mapping circuit are structured into residual bit planes by the bit stream encoding circuit 705. Let for each scale factor band s the maximum bit plane level (i.e. the position of the most significant bit in the bit words corresponding to the scale factor band s) be given by b_M[s] . The residual bit planes are processed according to a bit plane coding process, which may be carried out according to BPGC, CBAC or LEMC (Low Energy Mode Code) .

The coding sequence that is used may be seen to correspond, in principle, to bit plane scanning according to SLS as illustrated in figure 9. For each scale factor band, the bit plane coding starts from b_M[s] . For example, for s = 48, b_M[s] = 5. For the whole frame, the bit plane to be coded (in other words, scanned) is indicated by bp. As shown in figure 9, the first bit plane (bp = 1) includes the bit planes with scanning order "1". The second bp includes the bit planes with scanning order "2", and so on.

In 1404, the current (i.e. the currently processed) bit plane to be coded is set to bp = 1.

For bp = 1, the bit plane coding starts from sfb = 0 in 1405 which is increased by 1 in 1407 after bit plane coding the current (currently processed) scale factor band sfb of the current bit plane in 1406. In 1408, it is checked whether sfb < B[bp] .

If this is the case the process returns to 1406 (with increased sfb) such that the scale factor runs (in the loop formed by 1406 and 1408) from sfb = 0 to sfb = B[bp].

For example, B[I] is the last scale factor band in the first bit plane to be coded in 1406, e.g. by BPGC/CBAC coding (non LEMC) .

B[I] is for example defined as

L[s] < 0 V B[I] < s < S where L[s] is the lazy bit plane as defined in [4] an [5].

If sfb < B[bp] does not hold, the process continues with 1409. In 1409, a distortion check is performed. For example, for the first bit plane (bp = 1) , once the first bit plane for scale factor bands from 0 to B[I] is coded (which may also be seen as a collection of individual first bit planes for each scale factor band) , the distortion check is performed.

The distortion check includes a direct bit plane reconstruction, filling element and comparison process.

The reconstructed value e[k] [T] for residual frequency element k (0 < k < N, where N is the number of frequency coefficients per frame) in scale factor band s may be computed by using encoded bit planes from bp = 1 to bp = T (current bit plane until which the bit plane coding has been carried out at the current stage) :

e[k] [T]

where έ[k] is the reconstructed sign symbol (0 or 1), b[k] [bp] is the bit symbol (0 or 1) and b^M[s] is the total levels of bit planes for the current sfb.

If e[k] [T] ≠ 0 , the reconstruction can be further enhanced by an estimation process. Although the bit planes below the current bit plane bp = T have not been coded (yet) , they can be estimated based on the Laplacian distribution feature of the frequency elements in SLS coding. This reconstruction enhancement is supposed to be performed in the SLS decoder

(i.e., for example, the decoder 800), too. Specifically, the add-on amplitude for the following bit planes (i.e. the bit planes below bit plane T) can be estimated as

e[k] [T] b^M[s]-bp)

Here, Q_jju is the frequency assignment for BPGC coding and is defined as

₂L[s]-bp , bp < L[s]

Q^L[s][b_P] = 1 + 2'

I , bp > L[s]

2

and the final reconstructed spectrum coefficient e[k] [T] is obtained as

[T] + e[k] [T] , VT < L[s]

e[k] [T] , otherwise

provided k is a coefficient in scale factor band s. The value e[k] [T] can thus be seen as the residual value that would be received from reconstruction using the bits of the bit planes that have been encoded to the current (Tth iteration), i.e. from bp = 1 to bp = T.

The distortion for each scale factor band at terminating bit plane T is calculated as 0[s+l]-l d[s] [T] = 10 * 1Og₁ o ∑ (e[k] - e[k] [T]f k=O[s]

where O[s] is the starting frequency element number of scale factor band s. For each scale factor band, the distortion d[s] is compared with its respective mask M[s] in 1410.

If, for bp = 1 for example, for all the scale factor bands from 0 to B[I], the distortion is below the mask, the encoding for this frame can be stopped and the process continues with the next frame (if any), i.e. the process continues with testing whether there are any more frames in 1411. Otherwise, the last scale factor band is recorded which has noise that exceeds the corresponding mask (e.g., sfb 37 in figure 4A) . This scale factor band is, in the example of bp = 1, indicated as B[2], which is defined by

d[s] [1] < M[s] V B[2] < s < B[I]

(as the minimal value for which this expression is fulfilled) .

For bp = 1, since in 1412 sfb is increased to B[l]+1 and sfb < B[I] in 1414 does therefore not hold, the process continues, by increasing bp by 1 in 1415, with 1405 such that the coding continues for scale factor bands 0 to B [2] for bp = 2 (steps 1405 to 1408) .

For bp = 2, the distortion is checked again in 1409 after the encoding from 0 to B [2] is done, and if all the encoded scale factor bands have lower energy than the mask (which is checked in 1410) , the coding will stop for the current frame and proceed to the next frame unless the current frame is the final frame (which is checked in 1411) . Otherwise, the position of B [3] will be recorded, and the coding will continue in 1412, 1413 and 1414 from B[2]+l to B[I] for bp 2 and followed by 0 to B [3] for bp = 3 in 1405 to 1408.

The coding process continues in this manner until the condition that all the scale factor bands from 0 to B[bp] for bit plane bp have lower distortion than the mask is fulfilled.

B[bp] is computed by

d[s] [bp - 1] < M[s] V B[bp] < s < B[I]

(as the minimal value for which this expression is fulfilled) .

The encoding process described above with reference to figure 14, referred to as transparent bit plane encoding process in one embodiment, is illustrated in figure 15.

FIG. 15 shows a bit plane diagram 1500.

In the bit plane diagram 1500, the bit plane number decreases along a first axis 1501 (y-axis) such that a first bit plane (bp = 1) 1503 is at the top.

As an example, a second bit plane 1504 and a third bit plane 1505 are shown in figure 15.

The encoding direction within a bit plane 1503, 1504, 1505 is the direction of a second axis 1502 (x-axis) which is in this example also the direction of increasing scale factor band numbers .

In the first bit plane 1501, the bits of the first bit plane 1501 are scanned (e.g. added to a generated bit-stream, which may after generation be, for example, entropy-encoded) according to 1406 of figure 14 (for bp = 1) until a first check point 1506 given by B[I] is reached at which the distortion is checked according to step 1409. In one embodiment, this means that it is tested whether the residual values which would arise for the scale factor bands from the bits encoded (scanned) so far differ from the true residual values at most by values as given by the mask, i.e., whether the difference lies beneath a pre-defined masking threshold.

If the mask is exceeded for any scale factor band a second check point 1507 given by B [2] is set and the encoding process continues for the second bit plane 1505 according to 1406 (for bp = 2) until a second check point 1507 is reached. At this point, the distortion is checked again according to 1409 and the encoding process either stops (i.e. no further bits are scanned and the generation of the bit stream is for example finished and the bit stream may now, for example, be entropy encoded) or, if the mask is exceeded for any scale factor band, a third check point 1508 is set according to

B [3] and the remaining bits of the second bit plane 1504 are encoded according to 1413 until B[I] is reached.

Then, the bits of the third bit plane 1505 are encoded until the third check point 1508 is reached and, if the mask is exceeded for any scale factor band, a value B [4] is set and the remaining bits of the third bit plane 1505 are encoded until B[I] is reached. The process continues in this manner until the mask is not exceeded for any scale factor band.

It should be noted that if the encoding does not terminate at a checking point (given by B[bp]) of a bit plane since the mask is exceeded for at least one scale factor bands, the bits of the remaining scale factor bands from s = B[bp] + 1 to B[I] in the bit plane (bp) are encoded before the bits of the next bit plane bit plane (bp +1 ) starts. This is to satisfy the condition that the encoded format can be further enhanced to lossless.

According to the process described above, the encoding stops when the transparent quality can just be obtained. This may be achieved without making any change of a standard decoder (e.g. a standard SLS decoder) necessary.

In one embodiment, the encoding is carried out all the way to lossless (i.e. all bit planes are scanned) even if the checked distortion is below the mask for all the scale factor bands. In this case, the bit rate at which this condition is satisfied may be recorded and coded in meta data of the generated bit stream. This bit rate indicates the transparent bit rate for enhancement (e.g. SLS enhancement).

In other words, according to one embodiment, a method for generating encoded audio data is provided including an indication into the audio data specifying at least one part of the encoded audio data which does not have to be decoded for the decoded audio data to meet a pre-determined quality level . In other words, it may for example be indicated in an audio file which parts of the audio file can be skipped in decoding to achieve a pre-determined quality level, e.g. to achieve transparent quality. On the other hand, it may be indicated which parts should be decoded to achieve the pre-determined quality level. The indication may be a marking in the file from which on (or until which) bits included in the file may be skipped in the decoding without falling under the predetermined quality level.

For example, the indication specifies a bit-rate and the at least one part of the encoded audio data does not have to be decoded for the decoded audio to have the specified bit-rate. For example, at the specified bit-rate, the decoded audio data meets the pre-determined quality level.

In one embodiment, the pre-determined quality level is based on a human hearing characteristic. The pre-determined quality- level may for example be based on a pre-defined human hearing threshold.

In one embodiment, the pre-determined quality level is defined such that the noise introduced to the decoded audio data by leaving out the specified at least one part is below a pre-defined human hearing threshold.

In one embodiment, the encoded audio data includes a bit- stream, the bit-stream including a plurality of bits, and the indication specifies at least one set of bits of the bit- stream which does not have to be decoded for the decoded audio data to meet a pre-determined quality level. The set of bits is for example a continuous sub-bit-stream of the bit-stream.

The bit-stream may include a first part and a second part, the indication may specify which bits of the bit-stream belong the first part and which bits of the bit-stream belong to the second part and the decoded audio data meets the predetermined quality level if the first part is decoded and the second part is not decoded.

In the following, a method for smart enhancing bit rate estimation is described.

The bit rate needed for smart enhancing roughly depends on two parameters which can be extracted from the original signal and the lossy coded signal. In one embodiment, a bit rate estimation model is established, which allows estimation of the enhancing bit rate necessary for a low quality perceptual audio input to achieve transparent quality without actual encoding. This bit rate can for example be used for SLS lossless format to perform a smart truncation as illustrated in figure 13.

If one considers a matrix representation of the smart enhancing process as shown in figure 15, its depth (i.e. the number of bit planes that are at least partially scanned) can be estimated by the total residual energy (or noise in case of perceptual coding) to mask difference and the length of the matrix can be estimated by the percentage of low energy scale factor bands (position of B[I]) . Specifically, the noise to mask difference, D[s], for each scale factor band s in terms of dB can be simply computed as ^'E[S]"!

D[s] = 101og₁₀

,M[S],

where E[s] , the residual signal energy for scale factor band s, may be computed as

where e[k] is the residual coefficient as defined in equation (1) . Suppose that there are total of F frames in an excerpt, a first parameter for bit rate estimation, D^τ can be calculated as

F-I S-I ∑ ∑ D[s] DT _ f=0 S=O 1000

A second parameter, which is the length of the enhancing matrix, is the percentage of non-low energy scale factor bands in each frame. Training on a large set of data shows that a scale factor band is treated as "low energy" if the respective residual energy is lower than 2OdB. Therefore, if NL denotes the total number of scale factor bands with

E[s] > 2OdB for an excerpt, the second parameter P^ can be computed as

The total smart enhancing bit rate, R_s can thus be estimated as Rs = Co₁ • p + ov

*)

where co^ and α>2 are modulation constants. In one embodiment, by substituting two sets of experimental data for R₃ and

it is obtained that

( )

Co₁ = 0.5 CO₂ = 132.2

may for example be used in one embodiment.

In other words, in one embodiment, a method for determining a bit-rate of an encoded audio signal generated from an original audio signal is provided including determining, for a representation of the original audio signal as a core audio signal portion and a residual audio signal portion, a measure of the residual audio signal portion; and determining the bit-rate based on a comparison of the measure of the residual audio signal portion with a pre-defined residual threshold, wherein the pre-defined residual threshold is based on a predefined quality level.

In other words, in one embodiment, the bit-rate of audio data that is necessary to, for example, achieve a certain quality level such as a transparent quality level, is estimated, for example based on the way such audio data may be generated, such as it is described above with reference to figure 15. The bit-rate may be estimated based on the number of bit- planes that are presumably at least partially encoded and the number of bits (e.g. per frame) that is encoded from each bit-plane.

For example, the measure of the residual audio signal is an energy measure of the residual audio signal portion.

In one embodiment, the encoded audio signal is generated such that it fulfills the pre-defined quality level.

For example, the pre-determined quality level is based on a human hearing characteristic. The pre-determined quality level may for example be based on a pre-defined human hearing threshold.

In one embodiment, the bit-rate is further determined based on the number of residual audio signal values of the residual audio signal portion that fulfill a pre-determined criterion.

For example, the pre-determined criterion is based on a pre- defined residual value energy threshold.

In one embodiment, the bit-rate is further determined based on the number of residual audio signal values of the residual audio signal portion whose energy is beneath the residual value energy threshold.

From evaluation of one embodiment using the standard MPEG-4 audio test sequences including 15 stereo music files (with different genres) sampled at 48 kHz, 16 bits/sample it can be seen that

• For different test items, the bit rate required for transparent quality varies. • For most of the items, transparent quality can be achieved after the 2^nd bit plane is coded.

• The average transparent bit rate for all the 15 items is 205.2kbps. The average bit rate saving compared to standard 256kbps is 19.8%, with the maximum bit rate saving 49.6%.

• For certain test sequences, the bitrates required for transparent quality is relatively low. This may caused by o The dynamic range of the signal energy is low. o The signal has low energy level in high frequency domain, o The excerpt contains certain period of silence

(very low energy) frames.

From a subjective test (e.g. with 5% confidence level) it can be seen that for most of the items, smart enhancing brings a similar and near-transparent quality as the standard 256kbps does with a considerable amount of bit rate saving. For those audio excerpts which require more than 256kbps, apparent improvements are achieved.

From a comparison between the experimental smart enhancing bit rate and the estimated transparent bit rate it can be seen that the smart enhancing bit rate needed for transparent quality can be well estimated by using the two parameters extracted from the original and the AAC coded audio data without real encoding process. This enables the smart truncation function for lossless SLS formats.

For any method provided according to an embodiment of the invention, a computer readable medium may be provided which has computer instructions recorded thereon, which, when executed by a computer, make the computer perform the respective method.

The following documents are cited in the specification:

[1] Scalable Lossless Coding (SLS), ISO/IEC 14496-3 : 2005/Amd 3, 2006.

[2] R. Yu, S. Rahardja, X. Lin, and C. C. Koh, "A fine granular scalable to lossless audio coder," IEEE Trans.

Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1352-1363, JuI. 2006.

Claims

1. A method for encoding an audio signal comprising a core audio signal portion and a residual audio signal portion, the method comprising selecting, from the residual audio signal portion a first candidate residual audio signal portion and a second candidate residual audio signal portion; comparing the first candidate residual audio signal portion with the second candidate residual audio signal portion; depending on the result of the comparison, encoding the audio signal using the second candidate residual audio signal portion.

2. The method according to claim 1, wherein the core audio signal portion comprises a plurality of core audio signal values and wherein the residual audio signal portion comprises a plurality of residual audio signal values.

3. The method according to claim 1 or 2, wherein the first candidate residual audio signal portion is the residual audio signal portion.

4. The method according to any one of claims 1 to 3, wherein the first candidate residual audio signal portion is different from the second candidate residual audio signal portion.

5. The method according to any one of claims 1 to 3, wherein depending on the result of the comparison, the audio signal is encoded using the first candidate residual audio signal portion or a pre-defined process is performed.

6. The method according to claim 5, wherein the pre-defined process comprises selecting a third candidate residual audio signal portion different from the second candidate residual audio signal portion and encoding the audio signal using the third candidate residual audio signal portion.

7. The method according to any one of claims 1 to 6, wherein comparing the first candidate residual audio signal portion with the second candidate residual audio signal portion comprises checking whether the difference between at least one first residual value reconstructed from the first candidate residual audio signal portion according to a predetermined reconstruction scheme and at least one second residual value reconstructed from the second candidate residual audio signal portion according to the pre-determined reconstruction scheme is lower than a pre-defined threshold.

8. The method according to claim 7, wherein the pre-defined threshold is based on a human hearing perception threshold.

9. The method according to claim 7, wherein the pre-defined threshold is based on a human hearing mask.

10. The method according to any one of claims 1 to 9, wherein encoding the audio signal comprises generating a bit stream including the second candidate residual audio signal portion

11. The method according to any one of claims 1 to 10, wherein the residual audio signal portion comprises a plurality of residual audio signal values, and wherein selecting the second candidate residual audio signal portion is performed based on a bit word representation of each residual audio signal value and based on an order of the residual audio signal values.

12. The method according to claim 11, wherein selecting the second candidate residual audio signal portion comprises determining a minimum bit significance level; determining a border signal value position; including all bits of the bit word representations of the residual audio signal values that have, in their respective bit word, a higher bit significance level than the minimum bit significance level; and including all bits of the bit word representations of the residual audio signal values that have, in their respective bit word, the minimum bit significance level and that are part of a bit word representation of a signal value that has a position which fulfills a pre-defined criterion with regard to the border signal value position.

13. The method according to claim 12, wherein the pre-defined criterion is one of the position being below the border signal value position; the position being below or equal to the border signal value position; the position being above the border signal value position; and the position being above or equal to the border signal value position.

14. The method according to claims 12 or 13, wherein the minimum bit significance level is determined based on a comparison of the first candidate residual audio signal portion with a fourth candidate residual audio signal portion or is a pre-defined minimum bit significance level.

15. The method according to any one of claims 12 to 14, wherein the border signal value position is determined based on a comparison of the first candidate residual audio signal portion with a fifth candidate residual audio signal portion or is a pre-defined border signal value position.

16. The method according to any one of claims 11 to 15, wherein each residual audio signal value corresponds to at least one frequency.

17. The method according to claim 16, wherein each residual audio signal value corresponds to at least one scale factor band.

18. An encoder for encoding an audio signal comprising a core audio signal portion and a residual audio signal portion, the encoder comprising a selecting circuit configured to select, from the residual audio signal portion a first candidate residual audio signal portion and a second candidate residual audio signal portion; a comparing circuit, configured to compare the first candidate residual audio signal portion with the second candidate residual audio signal portion; an encoding circuit configured to, depending on the result of the comparison, encode the audio signal using the second candidate residual audio signal portion.

19. A method for generating encoded audio data comprising including an indication into the audio data specifying at least one part of the encoded audio data which does not have to be decoded for the decoded audio data to meet a predetermined quality level.

20. The method according to claim 19, wherein the indication specifies a bit-rate and the at least one part of the encoded audio data does not have to be decoded for the decoded audio to have the specified bit-rate.

21. The method according to claim 20, wherein at the specified bit-rate, the decoded audio data meets the predetermined quality level.

22. The method according to any one of claims 19 to 21, wherein the pre-determined quality level is based on a human hearing characteristic.

23. The method according to claim 22, wherein the predetermined quality level is based on a pre-defined human hearing threshold.

24. The method according to any one of claims 19 to 23, wherein the pre-determined quality level is defined such that the noise introduced to the decoded audio data by leaving out the specified at least one part is below a pre-defined human hearing threshold.

25. The method according to any one of claims 19 to 24, wherein the encoded audio data comprises a bit-stream, the bit-stream comprising a plurality of bits, and the indication specifies at least one set of bits of the bit-stream which does not have to be decoded for the decoded audio data to meet a pre-determined quality level.

26. The method according to claim 25, wherein the set of bits is a continuous sub-bit-stream of the bit-stream.

27. The method according to claim 25 or 26, wherein the bit- stream includes a first part and a second part, the indication specifies which bits of the bit-stream belong the first part and which bits of the bit-stream belong to the second part and the decoded audio data meets the predetermined quality level if the first part is decoded and the second part is not decoded.

28. A device for generating encoded audio data configured to include an indication into the audio data specifying at least one part of the encoded audio data which does not have to be decoded for the decoded audio data to meet a pre-determined quality level.

29. A method for determining a bit-rate of an encoded audio signal generated from an original audio signal comprising determining, for a representation of the original audio signal as a core audio signal portion and a residual audio signal portion, a measure of the residual audio signal portion; and determining the bit-rate based on a comparison of the measure of the residual audio signal portion with a pre-defined residual threshold, wherein the pre-defined residual threshold is based on a pre-defined quality level.

30. The method according to claim 29, wherein the measure of the residual audio signal is an energy measure of the residual audio signal portion.

31. The method according to claim 29 or 30, wherein the encoded audio signal is generated such that it fulfills the pre-defined quality level.

32. The method according to claim 31, wherein the predetermined quality level is based on a human hearing characteristic.

33. The method according to claim 32, wherein the predetermined quality level is based on a pre-defined human hearing threshold.

34. The method according to any one of claims 29 to 33, wherein the bit-rate is further determined based on the number of residual audio signal values of the residual audio signal portion that fulfill a pre-determined criterion.

35. The method according to claim 34, wherein the predetermined criterion is based on a pre-defined residual value energy threshold.

36. The method according to any one of claims 19 to 35, wherein the bit-rate is further determined based on the number of residual audio signal values of the residual audio signal portion whose energy is beneath the residual value energy threshold.

37. A device for determining a bit-rate of an encoded audio signal generated from an original audio signal comprising a first determining circuit configured to determine, for a representation of the original audio signal as a core audio signal portion and a residual audio signal portion, a measure of the residual audio signal portion; and a second determining circuit, configured to determine the bit-rate based on a comparison of the measure of the residual audio signal portion with a pre-defined residual threshold, wherein the pre-defined residual threshold is based on a predefined quality level.